Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50900 > unrolled thread

Find and Replace Simplification

Started byDevyn Collier Johnson <devyncjohnson@gmail.com>
First post2013-07-19 09:22 -0400
Last post2013-07-21 13:49 +0100
Articles 20 on this page of 23 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  Find and Replace Simplification Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-19 09:22 -0400
    Re: Find and Replace Simplification Novocastrian_Nomad <gregory.j.baker@gmail.com> - 2013-07-19 06:38 -0700
    Re: Find and Replace Simplification John Gordon <gordon@panix.com> - 2013-07-19 14:28 +0000
    Re: Find and Replace Simplification Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-19 16:22 +0000
      Re: Find and Replace Simplification Serhiy Storchaka <storchaka@gmail.com> - 2013-07-19 20:29 +0300
      Re: Find and Replace Simplification Skip Montanaro <skip@pobox.com> - 2013-07-19 13:08 -0500
      Re: Find and Replace Simplification Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-19 17:44 -0400
      Re: Find and Replace Simplification Dave Angel <davea@davea.name> - 2013-07-19 18:45 -0400
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-20 12:16 +0100
      Re: Find and Replace Simplification Serhiy Storchaka <storchaka@gmail.com> - 2013-07-20 14:48 +0300
      Re: Find and Replace Simplification Serhiy Storchaka <storchaka@gmail.com> - 2013-07-20 14:57 +0300
      Re: Find and Replace Simplification Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-20 08:41 -0400
      Re: Find and Replace Simplification Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-20 08:50 -0400
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-20 18:03 +0100
      Re: Find and Replace Simplification Dave Angel <davea@davea.name> - 2013-07-20 14:04 -0400
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-20 19:37 +0100
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-20 19:41 +0100
      Re: Find and Replace Simplification Dave Angel <davea@davea.name> - 2013-07-20 17:56 -0400
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-20 23:33 +0100
      Re: Find and Replace Simplification Serhiy Storchaka <storchaka@gmail.com> - 2013-07-21 10:44 +0300
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-21 12:29 +0100
      Re: Find and Replace Simplification Serhiy Storchaka <storchaka@gmail.com> - 2013-07-21 15:28 +0300
      Re: Find and Replace Simplification Joshua Landau <joshua@landau.ws> - 2013-07-21 13:49 +0100

Page 1 of 2  [1] 2  Next page →


#50900 — Find and Replace Simplification

FromDevyn Collier Johnson <devyncjohnson@gmail.com>
Date2013-07-19 09:22 -0400
SubjectFind and Replace Simplification
Message-ID<mailman.4865.1374240179.3114.python-list@python.org>
I have some code that I want to simplify. I know that a for-loop would 
work well, but can I make re.sub perform all of the below tasks at once, 
or can I write this in a way that is more efficient than using a for-loop?

DATA = re.sub(',', '', 'DATA')
DATA = re.sub('\'', '', 'DATA')
DATA = re.sub('(', '', 'DATA')
DATA = re.sub(')', '', 'DATA')

[toc] | [next] | [standalone]


#50903

FromNovocastrian_Nomad <gregory.j.baker@gmail.com>
Date2013-07-19 06:38 -0700
Message-ID<ec0b6a8b-ea8c-4137-b2a3-931bf401e0bd@googlegroups.com>
In reply to#50900
On Friday, July 19, 2013 7:22:48 AM UTC-6, Devyn Collier Johnson wrote:
> I have some code that I want to simplify. I know that a for-loop would 
> 
> work well, but can I make re.sub perform all of the below tasks at once, 
> or can I write this in a way that is more efficient than using a for-loop?
> 
> DATA = re.sub(',', '', 'DATA')
> DATA = re.sub('\'', '', 'DATA')
> DATA = re.sub('(', '', 'DATA')
> DATA = re.sub(')', '', 'DATA')


Try DATA = re.sub(r'[(,\\)]', '', 'DATA')

[toc] | [prev] | [next] | [standalone]


#50907

FromJohn Gordon <gordon@panix.com>
Date2013-07-19 14:28 +0000
Message-ID<ksbiep$qh0$1@reader2.panix.com>
In reply to#50900
In <mailman.4865.1374240179.3114.python-list@python.org> Devyn Collier Johnson <devyncjohnson@gmail.com> writes:

> I have some code that I want to simplify. I know that a for-loop would 
> work well, but can I make re.sub perform all of the below tasks at once, 
> or can I write this in a way that is more efficient than using a for-loop?

> DATA = re.sub(',', '', 'DATA')
> DATA = re.sub('\'', '', 'DATA')
> DATA = re.sub('(', '', 'DATA')
> DATA = re.sub(')', '', 'DATA')

If your actual use-case is this simple, you might want to use one of the
built-in string functions such as strip() or translate().

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon@panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"

[toc] | [prev] | [next] | [standalone]


#50914

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-19 16:22 +0000
Message-ID<51e967bb$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#50900
On Fri, 19 Jul 2013 09:22:48 -0400, Devyn Collier Johnson wrote:

> I have some code that I want to simplify. I know that a for-loop would
> work well, but can I make re.sub perform all of the below tasks at once,
> or can I write this in a way that is more efficient than using a
> for-loop?
> 
> DATA = re.sub(',', '', 'DATA')
> DATA = re.sub('\'', '', 'DATA')
> DATA = re.sub('(', '', 'DATA')
> DATA = re.sub(')', '', 'DATA')


I don't think you intended to put DATA in quotes on the right hand side. 
That makes it literally the string D A T A, so all those replacements are 
no-ops, and you could simplify it to:

DATA = 'DATA'

But that's probably not what you wanted.

My prediction is that this will be by far the most efficient way to do 
what you are trying to do:

py> DATA = "Hello, 'World'()"
py> DATA.translate(dict.fromkeys(ord(c) for c in ",'()"))
'Hello World'

That's in Python 3 -- in Python 2, using translate will still probably be 
the fastest, but you'll need to call it like this:

import string
DATA.translate(string.maketrans("", ""), ",'()")

I also expect that the string replace() method will be second fastest, 
and re.sub will be the slowest, by a very long way.

As a general rule, you should avoiding using regexes unless the text you 
are searching for actually contains a regular expression of some kind. If 
it's merely a literal character or substring, standard string methods 
will probably be faster.


Oh, and a tip for you:

- don't escape quotes unless you don't need to, use the other quote.

s = '\''  # No, don't do this!
s = "'"  # Better!

and vice versa.




-- 
Steven

[toc] | [prev] | [next] | [standalone]


#50917

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-19 20:29 +0300
Message-ID<mailman.4882.1374254975.3114.python-list@python.org>
In reply to#50914
19.07.13 19:22, Steven D'Aprano написав(ла):
> I also expect that the string replace() method will be second fastest,
> and re.sub will be the slowest, by a very long way.

The string replace() method is fastest (at least in Python 3.3+). See 
implementation of html.escape() etc.

[toc] | [prev] | [next] | [standalone]


#50919

FromSkip Montanaro <skip@pobox.com>
Date2013-07-19 13:08 -0500
Message-ID<mailman.4883.1374257324.3114.python-list@python.org>
In reply to#50914
Serhiy> The string replace() method is fastest (at least in Python 3.3+). See
Serhiy> implementation of html.escape() etc.

I trust everybody knows by now that when you want to use regular
expressions you should shell out to Perl for the best performance. :-)

Skip

[toc] | [prev] | [next] | [standalone]


#50922

FromDevyn Collier Johnson <devyncjohnson@gmail.com>
Date2013-07-19 17:44 -0400
Message-ID<mailman.4884.1374270271.3114.python-list@python.org>
In reply to#50914
On 07/19/2013 12:22 PM, Steven D'Aprano wrote:
> On Fri, 19 Jul 2013 09:22:48 -0400, Devyn Collier Johnson wrote:
>
>> I have some code that I want to simplify. I know that a for-loop would
>> work well, but can I make re.sub perform all of the below tasks at once,
>> or can I write this in a way that is more efficient than using a
>> for-loop?
>>
>> DATA = re.sub(',', '', 'DATA')
>> DATA = re.sub('\'', '', 'DATA')
>> DATA = re.sub('(', '', 'DATA')
>> DATA = re.sub(')', '', 'DATA')
>
> I don't think you intended to put DATA in quotes on the right hand side.
> That makes it literally the string D A T A, so all those replacements are
> no-ops, and you could simplify it to:
>
> DATA = 'DATA'
>
> But that's probably not what you wanted.
>
> My prediction is that this will be by far the most efficient way to do
> what you are trying to do:
>
> py> DATA = "Hello, 'World'()"
> py> DATA.translate(dict.fromkeys(ord(c) for c in ",'()"))
> 'Hello World'
>
> That's in Python 3 -- in Python 2, using translate will still probably be
> the fastest, but you'll need to call it like this:
>
> import string
> DATA.translate(string.maketrans("", ""), ",'()")
>
> I also expect that the string replace() method will be second fastest,
> and re.sub will be the slowest, by a very long way.
>
> As a general rule, you should avoiding using regexes unless the text you
> are searching for actually contains a regular expression of some kind. If
> it's merely a literal character or substring, standard string methods
> will probably be faster.
>
>
> Oh, and a tip for you:
>
> - don't escape quotes unless you don't need to, use the other quote.
>
> s = '\''  # No, don't do this!
> s = "'"  # Better!
>
> and vice versa.
>
>
>
>
Thanks for finding that error; DATA should not be in quotes. I cannot 
believe I missed that. Good eye Steven!

Using the replace command is a brilliant idea; I will implement that 
where ever I can. I am wanting to perform all of the replaces at once. 
Is that possible?

Mahalo,
DCJ

[toc] | [prev] | [next] | [standalone]


#50926

FromDave Angel <davea@davea.name>
Date2013-07-19 18:45 -0400
Message-ID<mailman.4888.1374273947.3114.python-list@python.org>
In reply to#50914
On 07/19/2013 05:44 PM, Devyn Collier Johnson wrote:
>
> On 07/19/2013 12:22 PM, Steven D'Aprano wrote:
>> On Fri, 19 Jul 2013 09:22:48 -0400, Devyn Collier Johnson wrote:
>>
>>> I have some code that I want to simplify. I know that a for-loop would
>>> work well, but can I make re.sub perform all of the below tasks at once,
>>> or can I write this in a way that is more efficient than using a
>>> for-loop?
>>>
>>> DATA = re.sub(',', '', 'DATA')
>>> DATA = re.sub('\'', '', 'DATA')
>>> DATA = re.sub('(', '', 'DATA')
>>> DATA = re.sub(')', '', 'DATA')
>>
>> I don't think you intended to put DATA in quotes on the right hand side.
>> That makes it literally the string D A T A, so all those replacements are
>> no-ops, and you could simplify it to:
>>
>> DATA = 'DATA'
>>
>> But that's probably not what you wanted.
>>
>> My prediction is that this will be by far the most efficient way to do
>> what you are trying to do:
>>
>> py> DATA = "Hello, 'World'()"
>> py> DATA.translate(dict.fromkeys(ord(c) for c in ",'()"))
>> 'Hello World'
>>
>> That's in Python 3 -- in Python 2, using translate will still probably be
>> the fastest, but you'll need to call it like this:
>>
>> import string
>> DATA.translate(string.maketrans("", ""), ",'()")
>>
>> I also expect that the string replace() method will be second fastest,
>> and re.sub will be the slowest, by a very long way.
>>
>> As a general rule, you should avoiding using regexes unless the text you
>> are searching for actually contains a regular expression of some kind. If
>> it's merely a literal character or substring, standard string methods
>> will probably be faster.
>>
>>
>> Oh, and a tip for you:
>>
>> - don't escape quotes unless you don't need to, use the other quote.
>>
>> s = '\''  # No, don't do this!
>> s = "'"  # Better!
>>
>> and vice versa.
>>
>>
>>
>>
> Thanks for finding that error; DATA should not be in quotes. I cannot
> believe I missed that. Good eye Steven!
>
> Using the replace command is a brilliant idea; I will implement that
> where ever I can. I am wanting to perform all of the replaces at once.
> Is that possible?
>

Read what you're quoting from.  The translate() method does just that. 
And maketrans() is the way to build a translate table.

On an Intel processor, the xlat instruction does a translate for one 
character, and adding a REP in front of it does it for an entire buffer. 
  No idea if Python takes advantage of that, however.




-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#50959

FromJoshua Landau <joshua@landau.ws>
Date2013-07-20 12:16 +0100
Message-ID<mailman.4915.1374319043.3114.python-list@python.org>
In reply to#50914
On 19 July 2013 18:29, Serhiy Storchaka <storchaka@gmail.com> wrote:
> 19.07.13 19:22, Steven D'Aprano написав(ла):
>
>> I also expect that the string replace() method will be second fastest,
>> and re.sub will be the slowest, by a very long way.
>
>
> The string replace() method is fastest (at least in Python 3.3+). See
> implementation of html.escape() etc.

def escape(s, quote=True):
    if quote:
        return s.translate(_escape_map_full)
    return s.translate(_escape_map)

I fail to see how this supports the assertion that str.replace() is
faster. However, some quick timing shows that translate has a very
high penalty for missing characters and is a tad slower any way.

Really, though, there should be no reason for .translate() to be
slower than replace -- at worst it should just be "reduce(lambda s,
ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
*same* speed as iterated replace. But the fact that it doesn't have to
re-build the string every replace means that theoretically it should
be a lot faster.

¹ I realise this won't actually work for several reasons, and doesn't
support things like passing in lists as mappings, but you could
trivially support the important builtin types² and fall back to the
original for others, where the pure-python __getitem__ is going to be
the slowest part anyway.

² List, tuple, dict, str, bytes -- so basically just mappings and
ordered iterables

[toc] | [prev] | [next] | [standalone]


#50960

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-20 14:48 +0300
Message-ID<mailman.4916.1374320911.3114.python-list@python.org>
In reply to#50914
19.07.13 21:08, Skip Montanaro написав(ла):
> Serhiy> The string replace() method is fastest (at least in Python 3.3+). See
> Serhiy> implementation of html.escape() etc.
> I trust everybody knows by now that when you want to use regular
> expressions you should shell out to Perl for the best performance. :-)

If you want to use regular expressions Python is not the best choice. 
But if you want to use Python regular expressions sometimes are the best 
choice.

[toc] | [prev] | [next] | [standalone]


#50961

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-20 14:57 +0300
Message-ID<mailman.4917.1374321439.3114.python-list@python.org>
In reply to#50914
20.07.13 14:16, Joshua Landau написав(ла):
> On 19 July 2013 18:29, Serhiy Storchaka <storchaka@gmail.com> wrote:
>> The string replace() method is fastest (at least in Python 3.3+). See
>> implementation of html.escape() etc.
>
> def escape(s, quote=True):
>      if quote:
>          return s.translate(_escape_map_full)
>      return s.translate(_escape_map)
>
> I fail to see how this supports the assertion that str.replace() is
> faster.

And now look at Python 3.4 sources.

> However, some quick timing shows that translate has a very
> high penalty for missing characters and is a tad slower any way.
>
> Really, though, there should be no reason for .translate() to be
> slower than replace -- at worst it should just be "reduce(lambda s,
> ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
> *same* speed as iterated replace.

It doesn't work such way. Consider 
'ab'.translate({ord('a'):'b',ord('b'):'a'}).

[toc] | [prev] | [next] | [standalone]


#50966

FromDevyn Collier Johnson <devyncjohnson@gmail.com>
Date2013-07-20 08:41 -0400
Message-ID<mailman.4922.1374324108.3114.python-list@python.org>
In reply to#50914
On 07/20/2013 07:16 AM, Joshua Landau wrote:
> On 19 July 2013 18:29, Serhiy Storchaka <storchaka@gmail.com> wrote:
>> 19.07.13 19:22, Steven D'Aprano написав(ла):
>>
>>> I also expect that the string replace() method will be second fastest,
>>> and re.sub will be the slowest, by a very long way.
>>
>> The string replace() method is fastest (at least in Python 3.3+). See
>> implementation of html.escape() etc.
> def escape(s, quote=True):
>      if quote:
>          return s.translate(_escape_map_full)
>      return s.translate(_escape_map)
>
> I fail to see how this supports the assertion that str.replace() is
> faster. However, some quick timing shows that translate has a very
> high penalty for missing characters and is a tad slower any way.
>
> Really, though, there should be no reason for .translate() to be
> slower than replace -- at worst it should just be "reduce(lambda s,
> ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
> *same* speed as iterated replace. But the fact that it doesn't have to
> re-build the string every replace means that theoretically it should
> be a lot faster.
>
> ¹ I realise this won't actually work for several reasons, and doesn't
> support things like passing in lists as mappings, but you could
> trivially support the important builtin types² and fall back to the
> original for others, where the pure-python __getitem__ is going to be
> the slowest part anyway.
>
> ² List, tuple, dict, str, bytes -- so basically just mappings and
> ordered iterables
Thanks Joshua Landau! str.replace() does appear to be best, so that is 
the suggestion that I will implement.

Mahalo,

DCJ

[toc] | [prev] | [next] | [standalone]


#50968

FromDevyn Collier Johnson <devyncjohnson@gmail.com>
Date2013-07-20 08:50 -0400
Message-ID<mailman.4924.1374324623.3114.python-list@python.org>
In reply to#50914
On 07/20/2013 07:48 AM, Serhiy Storchaka wrote:
> 19.07.13 21:08, Skip Montanaro написав(ла):
>> Serhiy> The string replace() method is fastest (at least in Python 
>> 3.3+). See
>> Serhiy> implementation of html.escape() etc.
>> I trust everybody knows by now that when you want to use regular
>> expressions you should shell out to Perl for the best performance. :-)
>
> If you want to use regular expressions Python is not the best choice. 
> But if you want to use Python regular expressions sometimes are the 
> best choice.
>
>
That is an interesting concept. (^u^)

Mahalo,
DCJ

[toc] | [prev] | [next] | [standalone]


#50970

FromJoshua Landau <joshua@landau.ws>
Date2013-07-20 18:03 +0100
Message-ID<mailman.4926.1374339885.3114.python-list@python.org>
In reply to#50914
On 20 July 2013 12:57, Serhiy Storchaka <storchaka@gmail.com> wrote:
> 20.07.13 14:16, Joshua Landau написав(ла):
>>
>> On 19 July 2013 18:29, Serhiy Storchaka <storchaka@gmail.com> wrote:
>>>
>>> The string replace() method is fastest (at least in Python 3.3+). See
>>> implementation of html.escape() etc.
>>
>>
>> def escape(s, quote=True):
>>      if quote:
>>          return s.translate(_escape_map_full)
>>      return s.translate(_escape_map)
>>
>> I fail to see how this supports the assertion that str.replace() is
>> faster.
>
>
> And now look at Python 3.4 sources.

I'll just trust you ;).

>> However, some quick timing shows that translate has a very
>> high penalty for missing characters and is a tad slower any way.
>>
>> Really, though, there should be no reason for .translate() to be
>> slower than replace -- at worst it should just be "reduce(lambda s,
>> ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
>> *same* speed as iterated replace.
>
>
> It doesn't work such way. Consider
> 'ab'.translate({ord('a'):'b',ord('b'):'a'}).

*sad*

Still, it seems to me that it should be optimizable for sensible
builtin types such that .translate is significantly faster, as there's
no theoretical extra work that .translate *has* to do that .replace
does not, and .replace also has to rebuild the string a lot of times.

[toc] | [prev] | [next] | [standalone]


#50971

FromDave Angel <davea@davea.name>
Date2013-07-20 14:04 -0400
Message-ID<mailman.4927.1374343508.3114.python-list@python.org>
In reply to#50914
On 07/20/2013 01:03 PM, Joshua Landau wrote:
> On 20 July 2013 12:57, Serhiy Storchaka <storchaka@gmail.com> wrote:
>> 20.07.13 14:16, Joshua Landau написав(ла):
>>>

     <snip>

>>> However, some quick timing shows that translate has a very
>>> high penalty for missing characters and is a tad slower any way.
>>>
>>> Really, though, there should be no reason for .translate() to be
>>> slower than replace -- at worst it should just be "reduce(lambda s,
>>> ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
>>> *same* speed as iterated replace.
>>
>>
>> It doesn't work such way. Consider
>> 'ab'.translate({ord('a'):'b',ord('b'):'a'}).
>
> *sad*
>
> Still, it seems to me that it should be optimizable for sensible
> builtin types such that .translate is significantly faster, as there's
> no theoretical extra work that .translate *has* to do that .replace
> does not, and .replace also has to rebuild the string a lot of times.
>

translate is going to be faster (than replace) for Unicode if it has a 
"large" table.  For example, to translate from ASCII to EBCDIC, where 
every character in the string is replaced by a new one.  I have no idea 
what the cutoff is.  But of course, for a case like ASCII to EBCDIC, it 
would be very tricky to do it with replaces, probably taking much more 
than the expected 96 passes.

translate for byte strings is undoubtedly tons faster.  For byte 
strings, the translation table is 256 bytes, and the inner loop is a 
simple lookup.  But for Unicode, the table is a dict (or something very 
like it, I looked at the C code, not the Python code).

So for every character in the input string, it does a dict-type lookup, 
before it can even decide if the character is going to change.

Just for reference, the two files I was looking at were:

objects/unicodeobject.c
objects/bytesobject.c

Extracted from the bz2 downloaded from the page:
     http://hg.python.org/cpython


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#50972

FromJoshua Landau <joshua@landau.ws>
Date2013-07-20 19:37 +0100
Message-ID<mailman.4928.1374345496.3114.python-list@python.org>
In reply to#50914
On 20 July 2013 19:04, Dave Angel <davea@davea.name> wrote:
> On 07/20/2013 01:03 PM, Joshua Landau wrote:
>>
>> Still, it seems to me that it should be optimizable for sensible
>> builtin types such that .translate is significantly faster, as there's
>> no theoretical extra work that .translate *has* to do that .replace
>> does not, and .replace also has to rebuild the string a lot of times.
>>
>
> translate is going to be faster (than replace) for Unicode if it has a
> "large" table.  For example, to translate from ASCII to EBCDIC, where every
> character in the string is replaced by a new one.  I have no idea what the
> cutoff is.  But of course, for a case like ASCII to EBCDIC, it would be very
> tricky to do it with replaces, probably taking much more than the expected
> 96 passes.

My timings showed that for ".upper()", doing the full 26 passes "a" ->
"A", it was *way* slower to use .translate than .replace, unless you
used a list or equiv. with much faster lookup. Even then, it was
slower to use .translate.

I agree that for large tables it's obviously going to swing the other
way, but by the time you're running .replace 26 times you wouldn't (at
least I wouldn't) expect it still to be screamingly faster than
.translate.

> translate for byte strings is undoubtedly tons faster.  For byte strings,
> the translation table is 256 bytes, and the inner loop is a simple lookup.

For my above test, .translate is about 10x faster than iterated .replace.

> But for Unicode, the table is a dict (or something very like it, I looked at
> the C code, not the Python code).
>
> So for every character in the input string, it does a dict-type lookup,
> before it can even decide if the character is going to change.

The problem can be solved, I'd imagine, for builtin types. Just build
an internal representation upon calling .translate that's faster. It's
especially easy in the list case -- just build a C array¹ at the start
mapping int -> int and then have really fast C mapping speeds.

For dictionaries, you can do the same thing -- you just have to make
sure you're not breaking any memory barriers.

¹ I don't do C or other low level languages, so my knowledge in this
area is embarrassingly bad

> Just for reference, the two files I was looking at were:
>
> objects/unicodeobject.c
> objects/bytesobject.c
>
> Extracted from the bz2 downloaded from the page:
>     http://hg.python.org/cpython

I didn't look at bytes first time, I might take a look later.

[toc] | [prev] | [next] | [standalone]


#50973

FromJoshua Landau <joshua@landau.ws>
Date2013-07-20 19:41 +0100
Message-ID<mailman.4929.1374345746.3114.python-list@python.org>
In reply to#50914
On 20 July 2013 19:37, Joshua Landau <joshua@landau.ws> wrote:
> mapping int -> int

Well, on second thought it's not quite this unless it's a 1:1 mapping.
Point remains valid, though, I think.

[toc] | [prev] | [next] | [standalone]


#50979

FromDave Angel <davea@davea.name>
Date2013-07-20 17:56 -0400
Message-ID<mailman.4932.1374357402.3114.python-list@python.org>
In reply to#50914
On 07/20/2013 02:37 PM, Joshua Landau wrote:
> On 20 July 2013 19:04, Dave Angel <davea@davea.name> wrote:
>> On 07/20/2013 01:03 PM, Joshua Landau wrote:
>>>
>>> Still, it seems to me that it should be optimizable for sensible
>>> builtin types such that .translate is significantly faster, as there's
>>> no theoretical extra work that .translate *has* to do that .replace
>>> does not, and .replace also has to rebuild the string a lot of times.
>>>
>>
>> translate is going to be faster (than replace) for Unicode if it has a
>> "large" table.  For example, to translate from ASCII to EBCDIC, where every
>> character in the string is replaced by a new one.  I have no idea what the
>> cutoff is.  But of course, for a case like ASCII to EBCDIC, it would be very
>> tricky to do it with replaces, probably taking much more than the expected
>> 96 passes.
>
> My timings showed that for ".upper()", doing the full 26 passes "a" ->
> "A", it was *way* slower to use .translate than .replace, unless you
> used a list or equiv. with much faster lookup. Even then, it was
> slower to use .translate.
>
> I agree that for large tables it's obviously going to swing the other
> way, but by the time you're running .replace 26 times you wouldn't (at
> least I wouldn't) expect it still to be screamingly faster than
> .translate.
>
>> translate for byte strings is undoubtedly tons faster.  For byte strings,
>> the translation table is 256 bytes, and the inner loop is a simple lookup.
>
> For my above test, .translate is about 10x faster than iterated .replace.
>
>> But for Unicode, the table is a dict (or something very like it, I looked at
>> the C code, not the Python code).
>>
>> So for every character in the input string, it does a dict-type lookup,
>> before it can even decide if the character is going to change.
>
> The problem can be solved, I'd imagine, for builtin types. Just build
> an internal representation upon calling .translate that's faster. It's
> especially easy in the list case

What "list case"?  list doesn't have a replace() method or translate() 
method.


> -- just build a C array¹ at the start
> mapping int -> int and then have really fast C mapping speeds.

As long as you can afford to have a list with a billion or so entries in 
it.  We are talking about strings and version 3.3, aren't we?  Of 
course, one could always examine the mapping object (table) and see what 
the max value was, and only build a "C array" if it was smaller than say 
50,000.

>
> For dictionaries, you can do the same thing -- you just have to make
> sure you're not breaking any memory barriers.
>
> ¹ I don't do C or other low level languages, so my knowledge in this
> area is embarrassingly bad
>
>> Just for reference, the two files I was looking at were:
>>
>> objects/unicodeobject.c
>> objects/bytesobject.c
>>
>> Extracted from the bz2 downloaded from the page:
>>      http://hg.python.org/cpython
>
> I didn't look at bytes first time, I might take a look later.
>


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#50981

FromJoshua Landau <joshua@landau.ws>
Date2013-07-20 23:33 +0100
Message-ID<mailman.4933.1374359653.3114.python-list@python.org>
In reply to#50914
On 20 July 2013 22:56, Dave Angel <davea@davea.name> wrote:
> On 07/20/2013 02:37 PM, Joshua Landau wrote:
>>
>> The problem can be solved, I'd imagine, for builtin types. Just build
>> an internal representation upon calling .translate that's faster. It's
>> especially easy in the list case
>
> What "list case"?  list doesn't have a replace() method or translate()
> method.

I mean some_str.translate(some_list).

>> -- just build a C array¹ at the start
>> mapping int -> int and then have really fast C mapping speeds.
>
>
> As long as you can afford to have a list with a billion or so entries in it.
> We are talking about strings and version 3.3, aren't we?  Of course, one
> could always examine the mapping object (table) and see what the max value
> was, and only build a "C array" if it was smaller than say 50,000.

When talking about some_str.translate(some_list), this doesn't apply
very much -- they've already gotten a much bigger Python list.

In the dict case² I don't actually want to jump to the conclusion that
one should do array-based mappings because I can see the obvious
downsides and it's obviously not good to have 100 cases in there,
*but* I still think that there's a solution.

Here are some ideas:
· Latin and ASCII can obviously be done with a C array, and I imagine
that covers at least a fair portion of use-cases.
· If you only have a few characters in the mapping (so sys.getsizeof
is small) then it'll be a lot faster to just iterate through a C list
instead of checking the dict.
· Other cases are:
    · Full-character-set or equiv. mappings, which are already faster
than .replace(). Those should really be re-made into lists so that the
list optimisation can take place, and lists are much faster even in
versions without these hypothetical optimizations, too.
    · Custom objects. There's nothing we can do here.

I realise that this is a lot more code, so it's not something I'm
going to try to force. However, I think it's useful if it stops people
using .replace in a loop ;).

² some_str.translate(some_dict)

[toc] | [prev] | [next] | [standalone]


#50999

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-07-21 10:44 +0300
Message-ID<mailman.4945.1374392677.3114.python-list@python.org>
In reply to#50914
20.07.13 20:03, Joshua Landau написав(ла):
> Still, it seems to me that it should be optimizable for sensible
> builtin types such that .translate is significantly faster, as there's
> no theoretical extra work that .translate *has* to do that .replace
> does not, and .replace also has to rebuild the string a lot of times.

You should analyze overall mapping and reorder items in right order (if 
it possible), i.e. '&' should be replaced before '<' in html.escape. 
This extra work is too large for most real input.

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web