Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: rh <richard_hubbe11@lavabit.com>
Subject: Re: Curious to see alternate approach on a search/replace via regex
Date: Thu, 7 Feb 2013 21:47:03 -0800
References: <mailman.1425.1360186878.2939.python-list@python.org> <14f76032-8753-4e17-897c-447242180e69@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
User-Agent: dsodnetnin
Original-Received: from slem by 1.1 with local
Archive: no
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1476.1360302430.2939.python-list@python.org>
Lines: 91
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:38407

On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
Nick Mellor <thebalancepro@gmail.com> wrote:

> Hi RH,
> 
> translate methods might be faster (and a little easier to read) for
> your use case. Just precompute and re-use the translation table
> punct_flatten.
> 
> Note that the translate method has changed somewhat for Python 3 due
> to the separation of text from bytes. The is a Python 3 version.
> 
> from urllib.parse import urlparse
> 
> flattened_chars = "./&=?"
> punct_flatten = str.maketrans(flattened_chars, '_' * len
> (flattened_chars)) parts = urlparse
> ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> unflattened = parts.netloc + parts.path + parts.query flattened =
> unflattened.translate(punct_flatten) print (flattened)

I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the 
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query

> 
> Cheers,
> 
> Nick
> 
> On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> > I am curious to know if others would have done this differently.
> > And if so
> > 
> > how so?
> > 
> > 
> > 
> > This converts a url to a more easily managed filename, stripping the
> > 
> > http protocol off. 
> > 
> > 
> > 
> > This:
> > 
> >  
> > 
> > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> > 
> > 
> > 
> > becomes this:
> > 
> > 
> > 
> > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> > 
> > 
> > 
> > 
> > 
> > def u2f(u):
> > 
> >     nx = re.compile(r'https?://(.+)$')
> > 
> >     u = nx.search(u).group(1)
> > 
> >     ux = re.compile(r'([-:./?&=]+)')
> > 
> >     return ux.sub('_', u)
> > 
> > 
> > 
> > One alternate is to not do the compile step. There must also be a
> > way to
> > 
> > do it all at once. i.e. remove the protocol and replace the chars.


--