Re: Curious to see alternate approach on a search/replace via regex

From	rh <richard_hubbe11@lavabit.com>
Subject	Re: Curious to see alternate approach on a search/replace via regex
Date	2013-02-07 21:47 -0800
References	<mailman.1425.1360186878.2939.python-list@python.org> <14f76032-8753-4e17-897c-447242180e69@googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.1476.1360302430.2939.python-list@python.org> (permalink)

Show all headers | View raw

On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
Nick Mellor <thebalancepro@gmail.com> wrote:

> Hi RH,
> 
> translate methods might be faster (and a little easier to read) for
> your use case. Just precompute and re-use the translation table
> punct_flatten.
> 
> Note that the translate method has changed somewhat for Python 3 due
> to the separation of text from bytes. The is a Python 3 version.
> 
> from urllib.parse import urlparse
> 
> flattened_chars = "./&=?"
> punct_flatten = str.maketrans(flattened_chars, '_' * len
> (flattened_chars)) parts = urlparse
> ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> unflattened = parts.netloc + parts.path + parts.query flattened =
> unflattened.translate(punct_flatten) print (flattened)

I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the 
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query

> 
> Cheers,
> 
> Nick
> 
> On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> > I am curious to know if others would have done this differently.
> > And if so
> > 
> > how so?
> > 
> > 
> > 
> > This converts a url to a more easily managed filename, stripping the
> > 
> > http protocol off. 
> > 
> > 
> > 
> > This:
> > 
> >  
> > 
> > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> > 
> > 
> > 
> > becomes this:
> > 
> > 
> > 
> > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> > 
> > 
> > 
> > 
> > 
> > def u2f(u):
> > 
> >     nx = re.compile(r'https?://(.+)$')
> > 
> >     u = nx.search(u).group(1)
> > 
> >     ux = re.compile(r'([-:./?&=]+)')
> > 
> >     return ux.sub('_', u)
> > 
> > 
> > 
> > One alternate is to not do the compile step. There must also be a
> > way to
> > 
> > do it all at once. i.e. remove the protocol and replace the chars.


--

Thread

Curious to see alternate approach on a search/replace via regex rh <richard_hubbe11@lavabit.com> - 2013-02-06 13:41 -0800
  Re: Curious to see alternate approach on a search/replace via regex Roy Smith <roy@panix.com> - 2013-02-06 16:54 -0500
  Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-07 04:53 -0800
    Re: Curious to see alternate approach on a search/replace via regex rh <richard_hubbe11@lavabit.com> - 2013-02-07 21:47 -0800
      Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-08 00:53 -0800
      Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-08 00:53 -0800
  Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-07 04:53 -0800

csiph-web