Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #63451

Re: Blog "about python 3"

References (16 earlier) <3519f85e-0909-4f5a-9a6e-09b6fd4c312d@googlegroups.com> <mailman.4915.1388875627.18130.python-list@python.org> <d8438ee4-1429-4855-9d78-b833f4f2748f@googlegroups.com> <mailman.4976.1388960067.18130.python-list@python.org> <2fbf4f89-caaa-4fab-8d7e-ff7ef84029a2@googlegroups.com>
Date 2014-01-08 09:38 +1100
Subject Re: Blog "about python 3"
From Tim Delaney <timothy.c.delaney@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.5152.1389134341.18130.python-list@python.org> (permalink)

Show all headers | View raw


[Multipart message — attachments visible in raw view] - view raw

On 8 January 2014 00:34, <wxjmfauth@gmail.com> wrote:

>
> Point 2: This Flexible String Representation does no
> "effectuate" any memory optimization. It only succeeds
> to do the opposite of what a corrrect usage of utf*
> do.
>

UTF-8 is a variable-width encoding that uses less memory to encode code
points with lower numerical values, on a per-character basis e.g. if a code
point <= U+007F it will use a single byte to encode; if <= U+07FF two bytes
will be used; ... up to a maximum of 6 bytes for code points >= U+4000000.

FSR is a variable-width memory structure that uses the width of the code
point with the highest numerical value in the string e.g. if all code
points in the string are <= U+00FF a single byte will be used per
character; if all code points are <= U+FFFF two bytes will be used per
character; and in all other cases 4 bytes will be used per character.

In terms of memory usage the difference is that UTF-8 varies its width
per-character, whereas the FSR varies its width per-string. For any
particular string, UTF-8 may well result in using less memory than the FSR,
but in other (quite common) cases the FSR will use less memory than UTF-8
e.g. if the string contains only contains code points <= U+00FF, but some
are between U+0080 and U+00FF (inclusive).

In most cases the FSR uses the same or less memory than earlier versions of
Python 3 and correctly handles all code points (just like UTF-8). In the
cases where the FSR uses more memory than previously, the previous
behaviour was incorrect.

No matter which representation is used, there will be a certain amount of
overhead (which is the majority of what most of your examples have shown).
Here are examples which demonstrate cases where UTF-8 uses less memory,
cases where the FSR uses less memory, and cases where they use the same
amount of memory (accounting for the minimum amount of overhead required
for each).

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64
bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>>
>>> fsr = u""
>>> utf8 = fsr.encode("utf-8")
>>> min_fsr_overhead = sys.getsizeof(fsr)
>>> min_utf8_overhead = sys.getsizeof(utf8)
>>> min_fsr_overhead
49
>>> min_utf8_overhead
33
>>>
>>> fsr = u"\u0001" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
1000
>>> sys.getsizeof(utf8) - min_utf8_overhead
1000
>>>
>>> fsr = u"\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
1024
>>> sys.getsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr = u"\u0001\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2024
>>> sys.getsizeof(utf8) - min_utf8_overhead
3000
>>>
>>> fsr = u"\u0101" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2025
>>> sys.getsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr = u"\u0101\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
4025
>>> sys.getsizeof(utf8) - min_utf8_overhead
4000

Indexing a character in UTF-8 is O(N) - you have to traverse the the string
up to the character being indexed. Indexing a character in the FSR is O(1).
In all cases the FSR has better performance characteristics for indexing
and slicing than UTF-8.

There are tradeoffs with both UTF-8 and the FSR. The Python developers
decided the priorities for Unicode handling in Python were:

1. Correctness
  a. all code points must be handled correctly;
  b.  it must not be possible to obtain part of a code point (e.g. the
first byte only of a multi-byte code point);

2. No change in the Big O characteristics of string operations e.g.
indexing must remain O(1);

3. Reduced memory use in most cases.

It is impossible for UTF-8 to meet both criteria 1b and 2 without
additional auxiliary data (which uses more memory and increases complexity
of the implementation). The FSR meets all 3 criteria.

Tim Delaney

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-30 19:41 +0000
  Re: Blog "about python 3" Steven D'Aprano <steve@pearwood.info> - 2013-12-30 20:49 +0000
    Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-30 21:29 +0000
    Re: Blog "about python 3" Ethan Furman <ethan@stoneleaf.us> - 2013-12-30 14:38 -0800
    Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2013-12-31 12:09 +1100
    Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-31 04:38 +0000
    Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2013-12-31 15:44 +1100
    Re: Blog "about python 3" Ethan Furman <ethan@stoneleaf.us> - 2013-12-30 20:33 -0800
    Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-31 04:59 +0000
    Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-31 08:22 +0000
      Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-31 20:53 +1100
        Re: Blog "about python 3" Antoine Pitrou <solipsis@pitrou.net> - 2013-12-31 14:13 +0000
          Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2013-12-31 10:41 -0500
            Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-01 02:54 +1100
            Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-31 15:55 +0000
            Re: Blog "about python 3" Robin Becker <robin@reportlab.com> - 2014-01-02 17:36 +0000
              Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-03 15:49 +1100
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-03 04:01 -0500
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-03 02:10 -0800
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-03 21:24 +1100
                Re: Blog "about python 3" Ethan Furman <ethan@stoneleaf.us> - 2014-01-03 08:56 -0800
                Re: Blog "about python 3" Robin Becker <robin@reportlab.com> - 2014-01-03 12:28 +0000
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-03 09:57 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-04 02:32 +1100
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-03 17:00 -0500
                Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-04 04:04 +0000
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-04 08:55 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 01:17 +1100
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-04 11:10 -0800
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-04 17:46 -0500
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-05 06:23 -0800
                Re: Blog "about python 3" Ned Batchelder <ned@nedbatchelder.com> - 2014-01-05 10:20 -0500
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-05 17:14 -0500
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-07 05:34 -0800
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-07 09:54 -0500
                Re: Blog "about python 3" Tim Delaney <timothy.c.delaney@gmail.com> - 2014-01-08 09:38 +1100
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-07 19:02 -0500
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-08 01:59 -0800
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-08 14:26 -0500
                Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-08 20:04 +0000
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-05 17:48 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 10:28 +1100
                Re: Blog "about python 3" Ned Batchelder <ned@nedbatchelder.com> - 2014-01-04 12:51 -0500
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-05 13:27 +1100
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 13:32 +1100
                Re: Blog "about python 3" MRAB <python@mrabarnett.plus.com> - 2014-01-05 02:41 +0000
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-04 22:20 -0500
                Re: Blog "about python 3" Rustom Mody <rustompmody@gmail.com> - 2014-01-05 10:12 +0530
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-05 00:11 -0500
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-05 17:28 +1100
                Re: Blog "about python 3" Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-01-05 14:05 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 15:01 +1100
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-05 11:34 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 03:51 +1100
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-05 12:09 -0500
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-06 11:42 +1100
                Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-05 17:56 -0500
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 10:59 +1100
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-06 12:23 +1100
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 12:54 +1100
                Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-06 05:53 +0000
                Re: Blog "about python 3" Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-01-05 00:00 -0800
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-05 23:28 +1100
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 23:48 +1100
                Re: Blog "about python 3" Roy Smith <roy@panix.com> - 2014-01-05 11:10 -0500
                Re: Blog "about python 3" Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-01-05 13:51 -0500
            Re: Blog "about python 3" David Hutto <dwightdhutto@gmail.com> - 2014-01-02 13:25 -0500
            Re: Blog "about python 3" Terry Reedy <tjreedy@udel.edu> - 2014-01-02 13:37 -0500
            Re: Blog "about python 3" Antoine Pitrou <solipsis@pitrou.net> - 2014-01-02 23:57 +0000
            Re: Blog "about python 3" Robin Becker <robin@reportlab.com> - 2014-01-03 10:32 +0000
            Re: Blog "about python 3" Robin Becker <robin@reportlab.com> - 2014-01-03 11:14 +0000
              Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-04 05:52 -0800
                Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-05 13:41 +1100
                Re: Blog "about python 3" Chris Angelico <rosuav@gmail.com> - 2014-01-05 13:54 +1100
                Re: Blog "about python 3" wxjmfauth@gmail.com - 2014-01-05 02:39 -0800
            Re: Blog "about python 3" Robin Becker <robin@reportlab.com> - 2014-01-03 11:37 +0000
            Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-04 07:30 +0000
        Re: Blog "about python 3" Johannes Bauer <dfnsonfsduifb@gmx.de> - 2014-01-05 13:14 +0100
          Re: Blog "about python 3" Stefan Behnel <stefan_ml@behnel.de> - 2014-01-05 14:55 +0100
        Re: Blog "about python 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-05 13:10 +0000
    Re: Blog "about python 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-31 20:04 +1100
    Re: Blog "about python 3" Devin Jeanpierre <jeanpierreda@gmail.com> - 2013-12-30 20:25 -0800

csiph-web