Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #76505 > unrolled thread
| Started by | Alex Willmer <alex@moreati.org.uk> |
|---|---|
| First post | 2014-08-18 12:16 -0700 |
| Last post | 2014-08-19 12:00 +0200 |
| Articles | 6 — 5 participants |
Back to article view | Back to comp.lang.python
Coding challenge: Optimise a custom string encoding Alex Willmer <alex@moreati.org.uk> - 2014-08-18 12:16 -0700
Re: Coding challenge: Optimise a custom string encoding Terry Reedy <tjreedy@udel.edu> - 2014-08-18 16:16 -0400
Re: Coding challenge: Optimise a custom string encoding Alex Willmer <alex@moreati.org.uk> - 2014-08-18 14:27 -0700
Re: Coding challenge: Optimise a custom string encoding Peter Otten <__peter__@web.de> - 2014-08-19 01:35 +0200
Re: Coding challenge: Optimise a custom string encoding Chris Angelico <rosuav@gmail.com> - 2014-08-19 09:28 +1000
Re: Coding challenge: Optimise a custom string encoding Lele Gaifax <lele@metapensiero.it> - 2014-08-19 12:00 +0200
| From | Alex Willmer <alex@moreati.org.uk> |
|---|---|
| Date | 2014-08-18 12:16 -0700 |
| Subject | Coding challenge: Optimise a custom string encoding |
| Message-ID | <6e869040-98e9-437b-b024-4ffe7abc3054@googlegroups.com> |
A challenge, just for fun. Can you speed up this function?
import string
charset = set(string.ascii_letters + string.digits + '@_-')
byteseq = [chr(i) for i in xrange(256)]
bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
for byte in byteseq}
def plus_encode(s):
"""Encode a unicode string with only ascii letters, digits, _, -, @, +
"""
bytemap_ = bytemap
s_utf8 = s.encode('utf-8')
return ''.join([bytemap[byte] for byte in s_utf8])
On my machine (Ubuntu 14.04, CPython 2.7.6, PyPy 2.2.1) this gets
alex@martha:~$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
100000 loops, best of 3: 2.96 usec per loop
alex@martha:~$ pypy -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
1000000 loops, best of 3: 1.24 usec per loop
Back story:
Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.
My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'. This evening as a learning exercise I've tried to make it fast. This is the result.
This challenge is just for fun. The chosen solution ended up being
def name_encode(s):
return %s_%s' % (s.encode('utf-8').encode('hex'),
re.replace('[A-Za-z0-9]', '', s))
Regards, Alex
[toc] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-08-18 16:16 -0400 |
| Message-ID | <mailman.13113.1408393206.18130.python-list@python.org> |
| In reply to | #76505 |
On 8/18/2014 3:16 PM, Alex Willmer wrote:
> A challenge, just for fun. Can you speed up this function?
You should give a specification here, with examples. You should perhaps
be using .maketrans and .translate.
> import string
>
> charset = set(string.ascii_letters + string.digits + '@_-')
> byteseq = [chr(i) for i in xrange(256)]
> bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
> for byte in byteseq}
>
> def plus_encode(s):
> """Encode a unicode string with only ascii letters, digits, _, -, @, +
> """
> bytemap_ = bytemap
> s_utf8 = s.encode('utf-8')
> return ''.join([bytemap[byte] for byte in s_utf8])
>
> On my machine (Ubuntu 14.04, CPython 2.7.6, PyPy 2.2.1) this gets
>
> alex@martha:~$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
> 100000 loops, best of 3: 2.96 usec per loop
>
> alex@martha:~$ pypy -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
> 1000000 loops, best of 3: 1.24 usec per loop
>
> Back story:
> Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.
>
> My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'. This evening as a learning exercise I've tried to make it fast. This is the result.
>
> This challenge is just for fun. The chosen solution ended up being
>
> def name_encode(s):
> return %s_%s' % (s.encode('utf-8').encode('hex'),
> re.replace('[A-Za-z0-9]', '', s))
>
> Regards, Alex
>
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Alex Willmer <alex@moreati.org.uk> |
|---|---|
| Date | 2014-08-18 14:27 -0700 |
| Message-ID | <ca7d388f-fd15-47bb-a500-b3aa10b707c6@googlegroups.com> |
| In reply to | #76508 |
On Monday, 18 August 2014 21:16:26 UTC+1, Terry Reedy wrote: > On 8/18/2014 3:16 PM, Alex Willmer wrote: > > A challenge, just for fun. Can you speed up this function? > > You should give a specification here, with examples. You should perhaps Sorry, the (informal) spec was further down. > > a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable. > > My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'. Other examples: >>> plus_encode(u'alice') 'alice' >>> plus_encode(u'Bacon & eggs only $19.95') 'Bacon+20+26+20eggs+20only+20+2419+2e95' >>> plus_encode(u'ünïcoԁë') '+c3+bc+ef+bd+8e+c3+af+ef+bd+83+ef+bd+8f+d4+81+c3+ab' > You should perhaps be using .maketrans and .translate. That wouldn't work, maketrans() can only map single bytes to other single bytes. To encode 256 possible source bytes with 66 possible symbols requires a multi-symbol expansion of some or all source bytes.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-08-19 01:35 +0200 |
| Message-ID | <mailman.13122.1408404927.18130.python-list@python.org> |
| In reply to | #76516 |
Alex Willmer wrote:
> On Monday, 18 August 2014 21:16:26 UTC+1, Terry Reedy wrote:
>> On 8/18/2014 3:16 PM, Alex Willmer wrote:
>> > A challenge, just for fun. Can you speed up this function?
>>
>> You should give a specification here, with examples. You should perhaps
>
> Sorry, the (informal) spec was further down.
>
>> > a custom encoding to store unicode usernames in a config file that only
>> > allowed mixed case ascii, digits, underscore, dash, at-sign and plus
>> > sign. We also wanted to keeping the encoded usernames somewhat human
>> > readable.
>
>> > My design was utf-8 and a variant of %-escaping, using the plus symbol.
>> > So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'.
>
> Other examples:
>>>> plus_encode(u'alice')
> 'alice'
>>>> plus_encode(u'Bacon & eggs only $19.95')
> 'Bacon+20+26+20eggs+20only+20+2419+2e95'
>>>> plus_encode(u'ünïcoԁë')
> '+c3+bc+ef+bd+8e+c3+af+ef+bd+83+ef+bd+8f+d4+81+c3+ab'
>
>> You should perhaps be using .maketrans and .translate.
>
> That wouldn't work, maketrans() can only map single bytes to other single
> bytes. To encode 256 possible source bytes with 66 possible symbols
> requires a multi-symbol expansion of some or all source bytes.
You can do the translation in unicode, but you have to cope with a big
translation table and the speed-up doesn't seem to be worthwhile:
$ cat plus_encode.py
# -*- coding: utf-8 -*-
import string
charset = set(string.ascii_letters + string.digits + '@_-')
byteseq = [chr(i) for i in xrange(256)]
bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
for byte in byteseq}
def plus_encode(s):
"""Encode a unicode string with only ascii letters, digits, _, -, @, +
"""
bytemap_ = bytemap
s_utf8 = s.encode('utf-8')
return ''.join([bytemap[byte] for byte in s_utf8])
import sys
from itertools import imap as map
MAXUNICODE = 9000 #should be sys.maxunicode
ucharset = set(c.decode("ascii") for c in charset)
xmap = [u if u in ucharset else
u"".join("+" + c.encode("hex") for c in u.encode("utf-8"))
for u in map(unichr, xrange(MAXUNICODE))]
def plus_encode2(s):
return s.translate(xmap).encode("ascii")
if __name__ == "__main__":
sample = u"".join(map(unichr, range(MAXUNICODE))) + u"€"
assert plus_encode(sample) == plus_encode2(sample)
$ python plus_encode.py
$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 10.6 usec per loop
$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode2(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 3.74 usec per loop
A smaller table is possible, but costs time:
ymap = [u if u in ucharset else
u"+" + u.encode("latin1").encode("hex").decode("ascii")
for u in map(unichr, xrange(256))]
def plus_encode3(s):
return s.encode("utf-8").decode("latin1").translate(ymap).encode("ascii")
$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode3(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 5.91 usec per loop
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-08-19 09:28 +1000 |
| Message-ID | <mailman.13121.1408404495.18130.python-list@python.org> |
| In reply to | #76505 |
On Tue, Aug 19, 2014 at 5:16 AM, Alex Willmer <alex@moreati.org.uk> wrote:
> Back story:
> Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.
>
If you can drop the "somewhat human readable" requirement, this fits
perfectly into a Base 64 encoding. All you need to do is this:
>>> import base64
>>> base64.b64encode("alic€123".encode(),b"+@").replace(b'=',b'-')
b'YWxpY+KCrDEyMw--'
The second argument specifies that, instead of the usual + and / for
the last two, + and @ are used instead. (The last step is because
Python's b64encode doesn't allow customization of the padding
character. Alternatively, you could simply rstrip() them, and
reinstate them by rounding up to four input bytes.)
Decoding is, obviously, the reverse:
>>> base64.b64decode(_.replace(b'-',b'='),b"+@").decode()
'alic€123'
This is done in Python 3, not Python 2. But I expect it'll work the
same way in 2.7.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2014-08-19 12:00 +0200 |
| Message-ID | <mailman.13140.1408442459.18130.python-list@python.org> |
| In reply to | #76505 |
Alex Willmer <alex@moreati.org.uk> writes:
> def plus_encode(s):
> """Encode a unicode string with only ascii letters, digits, _, -, @, +
> """
> bytemap_ = bytemap
> s_utf8 = s.encode('utf-8')
> return ''.join([bytemap[byte] for byte in s_utf8])
Minor nit: you defined a local alias for bytemap for faster access, but
didn't actually used it.
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web