Groups > comp.lang.python > #76505 > unrolled thread

Coding challenge: Optimise a custom string encoding

Started by	Alex Willmer <alex@moreati.org.uk>
First post	2014-08-18 12:16 -0700
Last post	2014-08-19 12:00 +0200
Articles	6 — 5 participants

Back to article view | Back to comp.lang.python

  Coding challenge: Optimise a custom string encoding Alex Willmer <alex@moreati.org.uk> - 2014-08-18 12:16 -0700
    Re: Coding challenge: Optimise a custom string encoding Terry Reedy <tjreedy@udel.edu> - 2014-08-18 16:16 -0400
      Re: Coding challenge: Optimise a custom string encoding Alex Willmer <alex@moreati.org.uk> - 2014-08-18 14:27 -0700
        Re: Coding challenge: Optimise a custom string encoding Peter Otten <__peter__@web.de> - 2014-08-19 01:35 +0200
    Re: Coding challenge: Optimise a custom string encoding Chris Angelico <rosuav@gmail.com> - 2014-08-19 09:28 +1000
    Re: Coding challenge: Optimise a custom string encoding Lele Gaifax <lele@metapensiero.it> - 2014-08-19 12:00 +0200

#76505 — Coding challenge: Optimise a custom string encoding

From	Alex Willmer <alex@moreati.org.uk>
Date	2014-08-18 12:16 -0700
Subject	Coding challenge: Optimise a custom string encoding
Message-ID	<6e869040-98e9-437b-b024-4ffe7abc3054@googlegroups.com>

A challenge, just for fun. Can you speed up this function?

import string

charset = set(string.ascii_letters + string.digits + '@_-')
byteseq = [chr(i) for i in xrange(256)]
bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
           for byte in byteseq}

def plus_encode(s):
    """Encode a unicode string with only ascii letters, digits, _, -, @, +
    """
    bytemap_ = bytemap
    s_utf8 = s.encode('utf-8')
    return ''.join([bytemap[byte] for byte in s_utf8])

On my machine (Ubuntu 14.04, CPython 2.7.6, PyPy 2.2.1) this gets

alex@martha:~$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
100000 loops, best of 3: 2.96 usec per loop

alex@martha:~$ pypy -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
1000000 loops, best of 3: 1.24 usec per loop

Back story:
Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.

My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'. This evening as a learning exercise I've tried to make it fast. This is the result.

This challenge is just for fun. The chosen solution ended up being

def name_encode(s):
    return %s_%s' % (s.encode('utf-8').encode('hex'),
                     re.replace('[A-Za-z0-9]', '', s))

Regards, Alex

[toc] | [next] | [standalone]

#76508

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-08-18 16:16 -0400
Message-ID	<mailman.13113.1408393206.18130.python-list@python.org>
In reply to	#76505

On 8/18/2014 3:16 PM, Alex Willmer wrote:
> A challenge, just for fun. Can you speed up this function?

You should give a specification here, with examples. You should perhaps 
be using .maketrans and .translate.

> import string
>
> charset = set(string.ascii_letters + string.digits + '@_-')
> byteseq = [chr(i) for i in xrange(256)]
> bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
>             for byte in byteseq}
>
> def plus_encode(s):
>      """Encode a unicode string with only ascii letters, digits, _, -, @, +
>      """
>      bytemap_ = bytemap
>      s_utf8 = s.encode('utf-8')
>      return ''.join([bytemap[byte] for byte in s_utf8])
>
> On my machine (Ubuntu 14.04, CPython 2.7.6, PyPy 2.2.1) this gets
>
> alex@martha:~$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
> 100000 loops, best of 3: 2.96 usec per loop
>
> alex@martha:~$ pypy -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()EURO""")'
> 1000000 loops, best of 3: 1.24 usec per loop
>
> Back story:
> Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.
>
> My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'. This evening as a learning exercise I've tried to make it fast. This is the result.
>
> This challenge is just for fun. The chosen solution ended up being
>
> def name_encode(s):
>      return %s_%s' % (s.encode('utf-8').encode('hex'),
>                       re.replace('[A-Za-z0-9]', '', s))
>
> Regards, Alex
>


-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#76516

From	Alex Willmer <alex@moreati.org.uk>
Date	2014-08-18 14:27 -0700
Message-ID	<ca7d388f-fd15-47bb-a500-b3aa10b707c6@googlegroups.com>
In reply to	#76508

On Monday, 18 August 2014 21:16:26 UTC+1, Terry Reedy  wrote:
> On 8/18/2014 3:16 PM, Alex Willmer wrote:
> > A challenge, just for fun. Can you speed up this function?
> 
> You should give a specification here, with examples. You should perhaps 

Sorry, the (informal) spec was further down.

> > a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.

> > My design was utf-8 and a variant of %-escaping, using the plus symbol. So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'.

Other examples:
>>> plus_encode(u'alice')
'alice'
>>> plus_encode(u'Bacon & eggs only $19.95')
'Bacon+20+26+20eggs+20only+20+2419+2e95'
>>> plus_encode(u'üｎïｃｏԁë')
'+c3+bc+ef+bd+8e+c3+af+ef+bd+83+ef+bd+8f+d4+81+c3+ab'

> You should perhaps be using .maketrans and .translate.

That wouldn't work, maketrans() can only map single bytes to other single bytes. To encode 256 possible source bytes with 66 possible symbols requires a multi-symbol expansion of some or all source bytes.

[toc] | [prev] | [next] | [standalone]

#76524

From	Peter Otten <__peter__@web.de>
Date	2014-08-19 01:35 +0200
Message-ID	<mailman.13122.1408404927.18130.python-list@python.org>
In reply to	#76516

Alex Willmer wrote:

> On Monday, 18 August 2014 21:16:26 UTC+1, Terry Reedy  wrote:
>> On 8/18/2014 3:16 PM, Alex Willmer wrote:
>> > A challenge, just for fun. Can you speed up this function?
>> 
>> You should give a specification here, with examples. You should perhaps
> 
> Sorry, the (informal) spec was further down.
> 
>> > a custom encoding to store unicode usernames in a config file that only
>> > allowed mixed case ascii, digits, underscore, dash, at-sign and plus
>> > sign. We also wanted to keeping the encoded usernames somewhat human
>> > readable.
> 
>> > My design was utf-8 and a variant of %-escaping, using the plus symbol.
>> > So u'alic EURO 123' would be encoded as b'alic+e2+82+ac123'.
> 
> Other examples:
>>>> plus_encode(u'alice')
> 'alice'
>>>> plus_encode(u'Bacon & eggs only $19.95')
> 'Bacon+20+26+20eggs+20only+20+2419+2e95'
>>>> plus_encode(u'üｎïｃｏԁë')
> '+c3+bc+ef+bd+8e+c3+af+ef+bd+83+ef+bd+8f+d4+81+c3+ab'
> 
>> You should perhaps be using .maketrans and .translate.
> 
> That wouldn't work, maketrans() can only map single bytes to other single
> bytes. To encode 256 possible source bytes with 66 possible symbols
> requires a multi-symbol expansion of some or all source bytes.

You can do the translation in unicode, but you have to cope with a big 
translation table and the speed-up doesn't seem to be worthwhile:

$ cat plus_encode.py
# -*- coding: utf-8 -*-
import string

charset = set(string.ascii_letters + string.digits + '@_-')
byteseq = [chr(i) for i in xrange(256)]
bytemap = {byte: byte if byte in charset else '+' + byte.encode('hex')
           for byte in byteseq}

def plus_encode(s):
    """Encode a unicode string with only ascii letters, digits, _, -, @, +
    """
    bytemap_ = bytemap
    s_utf8 = s.encode('utf-8')
    return ''.join([bytemap[byte] for byte in s_utf8])


import sys
from itertools import imap as map

MAXUNICODE = 9000 #should be sys.maxunicode
ucharset = set(c.decode("ascii") for c in charset)
xmap = [u if u in ucharset else 
        u"".join("+" + c.encode("hex") for c in u.encode("utf-8"))
        for u in map(unichr, xrange(MAXUNICODE))]

def plus_encode2(s):
    return s.translate(xmap).encode("ascii")

if __name__ == "__main__":
    sample = u"".join(map(unichr, range(MAXUNICODE))) + u"€"
    assert plus_encode(sample) == plus_encode2(sample)
$ python plus_encode.py
$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 10.6 usec per loop
$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode2(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 3.74 usec per loop

A smaller table is possible, but costs time:

ymap = [u if u in ucharset else
        u"+" + u.encode("latin1").encode("hex").decode("ascii")
        for u in map(unichr, xrange(256))]
def plus_encode3(s):
    return s.encode("utf-8").decode("latin1").translate(ymap).encode("ascii")

$ python -m timeit -s 'import plus_encode' 'plus_encode.plus_encode3(u"""qwertyuiop1234567890!"£$%^&*()\n{EURO-SIGN}""")'
100000 loops, best of 3: 5.91 usec per loop

[toc] | [prev] | [next] | [standalone]

#76523

From	Chris Angelico <rosuav@gmail.com>
Date	2014-08-19 09:28 +1000
Message-ID	<mailman.13121.1408404495.18130.python-list@python.org>
In reply to	#76505

On Tue, Aug 19, 2014 at 5:16 AM, Alex Willmer <alex@moreati.org.uk> wrote:
> Back story:
> Last week we needed a custom encoding to store unicode usernames in a config file that only allowed mixed case ascii, digits, underscore, dash, at-sign and plus sign. We also wanted to keeping the encoded usernames somewhat human readable.
>

If you can drop the "somewhat human readable" requirement, this fits
perfectly into a Base 64 encoding. All you need to do is this:

>>> import base64
>>> base64.b64encode("alic€123".encode(),b"+@").replace(b'=',b'-')
b'YWxpY+KCrDEyMw--'

The second argument specifies that, instead of the usual + and / for
the last two, + and @ are used instead. (The last step is because
Python's b64encode doesn't allow customization of the padding
character. Alternatively, you could simply rstrip() them, and
reinstate them by rounding up to four input bytes.)

Decoding is, obviously, the reverse:

>>> base64.b64decode(_.replace(b'-',b'='),b"+@").decode()
'alic€123'

This is done in Python 3, not Python 2. But I expect it'll work the
same way in 2.7.

ChrisA

[toc] | [prev] | [next] | [standalone]

#76555

From	Lele Gaifax <lele@metapensiero.it>
Date	2014-08-19 12:00 +0200
Message-ID	<mailman.13140.1408442459.18130.python-list@python.org>
In reply to	#76505

Alex Willmer <alex@moreati.org.uk> writes:

> def plus_encode(s):
>     """Encode a unicode string with only ascii letters, digits, _, -, @, +
>     """
>     bytemap_ = bytemap
>     s_utf8 = s.encode('utf-8')
>     return ''.join([bytemap[byte] for byte in s_utf8])

Minor nit: you defined a local alias for bytemap for faster access, but
didn't actually used it.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [standalone]

csiph-web

Coding challenge: Optimise a custom string encoding

Contents

#76505 — Coding challenge: Optimise a custom string encoding

#76508

#76516

#76524

#76523

#76555