Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53921

Re: Chardet, file, ... and the Flexible String Representation

From random832@fastmail.us
References <4ce85ea8-4a4c-46cf-a546-ad999576a5f7@googlegroups.com> <m2a9jqq7g9.fsf@cochabamba.vanoostrum.org> <04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com>
Subject Re: Chardet, file, ... and the Flexible String Representation
Date 2013-09-10 11:36 -0400
Newsgroups comp.lang.python
Message-ID <mailman.220.1378827397.5461.python-list@python.org> (permalink)

Show all headers | View raw


On Mon, Sep 9, 2013, at 10:28, wxjmfauth@gmail.com wrote:
*time performance differences*
> 
> Comment: Such differences never happen with utf.

Why is this bad? Keeping in mind that otherwise they would all be almost
as slow as the UCS-4 case.

> >>> sys.getsizeof('a')
> 26
> >>> sys.getsizeof('€')
> 40
> >>> sys.getsizeof('\U0001d11e')
> 44
> 
> Comment: 18 bytes more than latin-1
> 
> Comment: With utf, a char (in string or not) never exceed 4 

A string is an object and needs to store the length, along with any
overhead relating to object headers. I believe there is also an appended
null character. Also, ASCII strings are stored differently from Latin-1
strings.

>>> sys.getsizeof('a'*999)
1048 = 49 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\xa4'*999)
1072 = 74 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\u20ac'*999)
2072 = 76 bytes overhead, 2 bytes per character.
>>> sys.getsizeof('\U0001d11e'*999)
4072 = 80 bytes overhead, 4 bytes per character.

(I bet sys.getsizeof('\xa4') will return 38 on your system, so 44 is
only six bytes more, not 18)

If we did not have the FSR, everything would be 4 bytes per character.
We might have less overhead, but a string only has to be 25 characters
long before the savings from the shorter representation outweigh even
having _no_ overhead, and every four bytes of overhead reduces that
number by one. And you have a 32-bit python build, which has less
overhead than mine - in yours, strings only have to be seven characters
long for the FSR to be worth it. Assume the minimum possible overhead is
two words for the object header, a size, and a pointer - i.e. sixteen
bytes, compared to the 25 you've demonstrated for ASCII, and strings
only need to be _two_ characters long for the FSR to be a better deal
than always using UCS4 strings.

The need for four-byte-per-character strings would not go away by
eliminating the FSR, so you're basically saying that everything should
be constrained to the worst-case performance scenario.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-06 02:11 -0700
  Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-06 10:57 +0000
  Re: Chardet, file, ... and the Flexible String Representation Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-09-06 13:10 +0200
  Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-06 07:02 -0400
  Re: Chardet, file, ... and the Flexible String Representation Piet van Oostrum <piet@vanoostrum.org> - 2013-09-06 11:46 -0400
    Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 02:04 +1000
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-06 12:59 -0400
    Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 03:04 +1000
    Re: Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-09 07:28 -0700
      Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-09 12:38 -0400
      Re: Chardet, file, ... and the Flexible String Representation Michael Torrie <torriem@gmail.com> - 2013-09-09 11:05 -0600
        Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve@pearwood.info> - 2013-09-10 04:58 +0000
      Re: Chardet, file, ... and the Flexible String Representation Terry Reedy <tjreedy@udel.edu> - 2013-09-09 16:47 -0400
      Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-10 11:36 -0400
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 14:34 -0400
    Re: Chardet, file, ... and the Flexible String Representation Ian Kelly <ian.g.kelly@gmail.com> - 2013-09-09 13:03 -0600
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 15:27 -0400
    Re: Chardet, file, ... and the Flexible String Representation Serhiy Storchaka <storchaka@gmail.com> - 2013-09-12 00:11 +0300

csiph-web