Groups > comp.lang.python > #38302 > unrolled thread

Curious to see alternate approach on a search/replace via regex

Started by	rh <richard_hubbe11@lavabit.com>
First post	2013-02-06 13:41 -0800
Last post	2013-02-07 04:53 -0800
Articles	7 — 3 participants

Back to article view | Back to comp.lang.python

  Curious to see alternate approach on a search/replace via regex rh <richard_hubbe11@lavabit.com> - 2013-02-06 13:41 -0800
    Re: Curious to see alternate approach on a search/replace via regex Roy Smith <roy@panix.com> - 2013-02-06 16:54 -0500
    Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-07 04:53 -0800
      Re: Curious to see alternate approach on a search/replace via regex rh <richard_hubbe11@lavabit.com> - 2013-02-07 21:47 -0800
        Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-08 00:53 -0800
        Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-08 00:53 -0800
    Re: Curious to see alternate approach on a search/replace via regex Nick Mellor <thebalancepro@gmail.com> - 2013-02-07 04:53 -0800

#38302 — Curious to see alternate approach on a search/replace via regex

From	rh <richard_hubbe11@lavabit.com>
Date	2013-02-06 13:41 -0800
Subject	Curious to see alternate approach on a search/replace via regex
Message-ID	<mailman.1425.1360186878.2939.python-list@python.org>

I am curious to know if others would have done this differently. And if so
how so?

This converts a url to a more easily managed filename, stripping the
http protocol off. 

This:
 
http://alongnameofasite1234567.com/q?sports=run&a=1&b=1

becomes this:

alongnameofasite1234567_com_q_sports_run_a_1_b_1


def u2f(u):
    nx = re.compile(r'https?://(.+)$')
    u = nx.search(u).group(1)
    ux = re.compile(r'([-:./?&=]+)')
    return ux.sub('_', u)

One alternate is to not do the compile step. There must also be a way to
do it all at once. i.e. remove the protocol and replace the chars.

[toc] | [next] | [standalone]

#38303

From	Roy Smith <roy@panix.com>
Date	2013-02-06 16:54 -0500
Message-ID	<roy-4B81A4.16544906022013@news.panix.com>
In reply to	#38302

In article <mailman.1425.1360186878.2939.python-list@python.org>,
 rh <richard_hubbe11@lavabit.com> wrote:

> I am curious to know if others would have done this differently. And if so
> how so?
> 
> This converts a url to a more easily managed filename, stripping the
> http protocol off. 

I would have used the urlparse module.

http://docs.python.org/2/library/urlparse.html

[toc] | [prev] | [next] | [standalone]

#38348

From	Nick Mellor <thebalancepro@gmail.com>
Date	2013-02-07 04:53 -0800
Message-ID	<14f76032-8753-4e17-897c-447242180e69@googlegroups.com>
In reply to	#38302

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> I am curious to know if others would have done this differently. And if so
> 
> how so?
> 
> 
> 
> This converts a url to a more easily managed filename, stripping the
> 
> http protocol off. 
> 
> 
> 
> This:
> 
>  
> 
> http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> 
> 
> 
> becomes this:
> 
> 
> 
> alongnameofasite1234567_com_q_sports_run_a_1_b_1
> 
> 
> 
> 
> 
> def u2f(u):
> 
>     nx = re.compile(r'https?://(.+)$')
> 
>     u = nx.search(u).group(1)
> 
>     ux = re.compile(r'([-:./?&=]+)')
> 
>     return ux.sub('_', u)
> 
> 
> 
> One alternate is to not do the compile step. There must also be a way to
> 
> do it all at once. i.e. remove the protocol and replace the chars.

[toc] | [prev] | [next] | [standalone]

#38407

From	rh <richard_hubbe11@lavabit.com>
Date	2013-02-07 21:47 -0800
Message-ID	<mailman.1476.1360302430.2939.python-list@python.org>
In reply to	#38348

On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
Nick Mellor <thebalancepro@gmail.com> wrote:

> Hi RH,
> 
> translate methods might be faster (and a little easier to read) for
> your use case. Just precompute and re-use the translation table
> punct_flatten.
> 
> Note that the translate method has changed somewhat for Python 3 due
> to the separation of text from bytes. The is a Python 3 version.
> 
> from urllib.parse import urlparse
> 
> flattened_chars = "./&=?"
> punct_flatten = str.maketrans(flattened_chars, '_' * len
> (flattened_chars)) parts = urlparse
> ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> unflattened = parts.netloc + parts.path + parts.query flattened =
> unflattened.translate(punct_flatten) print (flattened)

I like the idea of using a library but since I'm learning python I wanted
to try out the regex stuff. I haven't looked but I'd think that urllib might
(should?) have a builtin so that one wouldn't have to specify the 
flattened_chars list. I'm sure there's a name for those chars but I don't know
it. Maybe just punctuation??

Also my version converts the ? into _ but urllib sees that as the query
separator and removes it. Just point this out for completeness sake.

This would mimic what I did:
unflattened = parts.netloc + parts.path + '_' + parts.query

> 
> Cheers,
> 
> Nick
> 
> On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> > I am curious to know if others would have done this differently.
> > And if so
> > 
> > how so?
> > 
> > 
> > 
> > This converts a url to a more easily managed filename, stripping the
> > 
> > http protocol off. 
> > 
> > 
> > 
> > This:
> > 
> >  
> > 
> > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> > 
> > 
> > 
> > becomes this:
> > 
> > 
> > 
> > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> > 
> > 
> > 
> > 
> > 
> > def u2f(u):
> > 
> >     nx = re.compile(r'https?://(.+)$')
> > 
> >     u = nx.search(u).group(1)
> > 
> >     ux = re.compile(r'([-:./?&=]+)')
> > 
> >     return ux.sub('_', u)
> > 
> > 
> > 
> > One alternate is to not do the compile step. There must also be a
> > way to
> > 
> > do it all at once. i.e. remove the protocol and replace the chars.


--

[toc] | [prev] | [next] | [standalone]

#38431

From	Nick Mellor <thebalancepro@gmail.com>
Date	2013-02-08 00:53 -0800
Message-ID	<d325e0c9-977d-415f-8bde-1e2e38ef1f19@googlegroups.com>
In reply to	#38407

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

On Friday, 8 February 2013 16:47:03 UTC+11, rh  wrote:
> On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
> 
> Nick Mellor <t...o@gmail.com> wrote:
> 
> 
> 
> > Hi RH,
> 
> > 
> 
> > translate methods might be faster (and a little easier to read) for
> 
> > your use case. Just precompute and re-use the translation table
> 
> > punct_flatten.
> 
> > 
> 
> > Note that the translate method has changed somewhat for Python 3 due
> 
> > to the separation of text from bytes. The is a Python 3 version.
> 
> > 
> 
> > from urllib.parse import urlparse
> 
> > 
> 
> > flattened_chars = "./&=?"
> 
> > punct_flatten = str.maketrans(flattened_chars, '_' * len
> 
> > (flattened_chars)) parts = urlparse
> 
> > ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> 
> > unflattened = parts.netloc + parts.path + parts.query flattened =
> 
> > unflattened.translate(punct_flatten) print (flattened)
> 
> 
> 
> I like the idea of using a library but since I'm learning python I wanted
> 
> to try out the regex stuff. I haven't looked but I'd think that urllib might
> 
> (should?) have a builtin so that one wouldn't have to specify the 
> 
> flattened_chars list. I'm sure there's a name for those chars but I don't know
> 
> it. Maybe just punctuation??
> 
> 
> 
> Also my version converts the ? into _ but urllib sees that as the query
> 
> separator and removes it. Just point this out for completeness sake.
> 
> 
> 
> This would mimic what I did:
> 
> unflattened = parts.netloc + parts.path + '_' + parts.query
> 
> 
> 
> > 
> 
> > Cheers,
> 
> > 
> 
> > Nick
> 
> > 
> 
> > On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> 
> > > I am curious to know if others would have done this differently.
> 
> > > And if so
> 
> > > 
> 
> > > how so?
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > This converts a url to a more easily managed filename, stripping the
> 
> > > 
> 
> > > http protocol off. 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > This:
> 
> > > 
> 
> > >  
> 
> > > 
> 
> > > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > becomes this:
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > def u2f(u):
> 
> > > 
> 
> > >     nx = re.compile(r'https?://(.+)$')
> 
> > > 
> 
> > >     u = nx.search(u).group(1)
> 
> > > 
> 
> > >     ux = re.compile(r'([-:./?&=]+)')
> 
> > > 
> 
> > >     return ux.sub('_', u)
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > One alternate is to not do the compile step. There must also be a
> 
> > > way to
> 
> > > 
> 
> > > do it all at once. i.e. remove the protocol and replace the chars.
> 
> 
> 
> 
> 
> --

[toc] | [prev] | [next] | [standalone]

#38432

From	Nick Mellor <thebalancepro@gmail.com>
Date	2013-02-08 00:53 -0800
Message-ID	<mailman.1493.1360313601.2939.python-list@python.org>
In reply to	#38407

Hi RH,

It's essential to know about regex, of course, but often there's a better, easier-to-read way to do things in Python.

One of Python's aims is clarity and ease of reading.

Regex is complex, potentially inefficient and hard to read (as well as being the only reasonable way to do things sometimes.)

Best,

Nick

On Friday, 8 February 2013 16:47:03 UTC+11, rh  wrote:
> On Thu, 7 Feb 2013 04:53:22 -0800 (PST)
> 
> Nick Mellor <t...o@gmail.com> wrote:
> 
> 
> 
> > Hi RH,
> 
> > 
> 
> > translate methods might be faster (and a little easier to read) for
> 
> > your use case. Just precompute and re-use the translation table
> 
> > punct_flatten.
> 
> > 
> 
> > Note that the translate method has changed somewhat for Python 3 due
> 
> > to the separation of text from bytes. The is a Python 3 version.
> 
> > 
> 
> > from urllib.parse import urlparse
> 
> > 
> 
> > flattened_chars = "./&=?"
> 
> > punct_flatten = str.maketrans(flattened_chars, '_' * len
> 
> > (flattened_chars)) parts = urlparse
> 
> > ('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
> 
> > unflattened = parts.netloc + parts.path + parts.query flattened =
> 
> > unflattened.translate(punct_flatten) print (flattened)
> 
> 
> 
> I like the idea of using a library but since I'm learning python I wanted
> 
> to try out the regex stuff. I haven't looked but I'd think that urllib might
> 
> (should?) have a builtin so that one wouldn't have to specify the 
> 
> flattened_chars list. I'm sure there's a name for those chars but I don't know
> 
> it. Maybe just punctuation??
> 
> 
> 
> Also my version converts the ? into _ but urllib sees that as the query
> 
> separator and removes it. Just point this out for completeness sake.
> 
> 
> 
> This would mimic what I did:
> 
> unflattened = parts.netloc + parts.path + '_' + parts.query
> 
> 
> 
> > 
> 
> > Cheers,
> 
> > 
> 
> > Nick
> 
> > 
> 
> > On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> 
> > > I am curious to know if others would have done this differently.
> 
> > > And if so
> 
> > > 
> 
> > > how so?
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > This converts a url to a more easily managed filename, stripping the
> 
> > > 
> 
> > > http protocol off. 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > This:
> 
> > > 
> 
> > >  
> 
> > > 
> 
> > > http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > becomes this:
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > alongnameofasite1234567_com_q_sports_run_a_1_b_1
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > def u2f(u):
> 
> > > 
> 
> > >     nx = re.compile(r'https?://(.+)$')
> 
> > > 
> 
> > >     u = nx.search(u).group(1)
> 
> > > 
> 
> > >     ux = re.compile(r'([-:./?&=]+)')
> 
> > > 
> 
> > >     return ux.sub('_', u)
> 
> > > 
> 
> > > 
> 
> > > 
> 
> > > One alternate is to not do the compile step. There must also be a
> 
> > > way to
> 
> > > 
> 
> > > do it all at once. i.e. remove the protocol and replace the chars.
> 
> 
> 
> 
> 
> --

[toc] | [prev] | [next] | [standalone]

#38350

From	Nick Mellor <thebalancepro@gmail.com>
Date	2013-02-07 04:53 -0800
Message-ID	<mailman.1449.1360246767.2939.python-list@python.org>
In reply to	#38302

Hi RH,

translate methods might be faster (and a little easier to read) for your use case. Just precompute and re-use the translation table punct_flatten.

Note that the translate method has changed somewhat for Python 3 due to the separation of text from bytes. The is a Python 3 version.

from urllib.parse import urlparse

flattened_chars = "./&=?"
punct_flatten = str.maketrans(flattened_chars, '_' * len(flattened_chars))
parts = urlparse('http://alongnameofasite1234567.com/q?sports=run&a=1&b=1')
unflattened = parts.netloc + parts.path + parts.query
flattened = unflattened.translate(punct_flatten)
print (flattened)

Cheers,

Nick

On Thursday, 7 February 2013 08:41:05 UTC+11, rh  wrote:
> I am curious to know if others would have done this differently. And if so
> 
> how so?
> 
> 
> 
> This converts a url to a more easily managed filename, stripping the
> 
> http protocol off. 
> 
> 
> 
> This:
> 
>  
> 
> http://alongnameofasite1234567.com/q?sports=run&a=1&b=1
> 
> 
> 
> becomes this:
> 
> 
> 
> alongnameofasite1234567_com_q_sports_run_a_1_b_1
> 
> 
> 
> 
> 
> def u2f(u):
> 
>     nx = re.compile(r'https?://(.+)$')
> 
>     u = nx.search(u).group(1)
> 
>     ux = re.compile(r'([-:./?&=]+)')
> 
>     return ux.sub('_', u)
> 
> 
> 
> One alternate is to not do the compile step. There must also be a way to
> 
> do it all at once. i.e. remove the protocol and replace the chars.

[toc] | [prev] | [standalone]

csiph-web

Curious to see alternate approach on a search/replace via regex

Contents

#38302 — Curious to see alternate approach on a search/replace via regex

#38303

#38348

#38407

#38431

#38432

#38350