Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #77479 > unrolled thread

How to turn a string into a list of integers?

Started bycl@isbd.net
First post2014-09-03 13:27 +0100
Last post2014-09-03 07:30 -0700
Articles 20 on this page of 35 — 14 participants

Back to article view | Back to comp.lang.python


Contents

  How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 13:27 +0100
    Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-03 14:52 +0200
      Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 15:48 +0100
        Re: How to turn a string into a list of integers? Joshua Landau <joshua@landau.ws> - 2014-09-04 22:06 +0100
          Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-05 09:42 +0100
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 19:56 +0200
              Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 15:47 +1000
                Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-06 10:22 +0200
                  Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 21:17 +1000
                Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 14:15 +0200
                  Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 04:19 +1000
                    Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 21:28 +0200
                      Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 11:47 +1000
                        Re: How to turn a string into a list of integers? MRAB <python@mrabarnett.plus.com> - 2014-09-07 15:52 +0100
                          Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 03:02 +1000
                            Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 10:53 -0700
                              Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:08 +1000
                                Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 11:34 -0700
                                  Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:14 +1000
                                    Re: How to turn a string into a list of integers? Marko Rauhamaa <marko@pacujo.net> - 2014-09-08 08:44 +0300
                                      Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 15:53 +1000
                                      Re: How to turn a string into a list of integers? Terry Reedy <tjreedy@udel.edu> - 2014-09-08 03:41 -0400
                        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 01:04 +1000
                          Re: How to turn a string into a list of integers? Roy Smith <roy@panix.com> - 2014-09-07 11:40 -0400
                            Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:00 +1000
                            Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:12 +1000
                Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-06 22:23 +1000
            Re: How to turn a string into a list of integers? Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2014-09-05 20:25 +0200
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 21:16 +0200
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 22:41 +0200
        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 10:12 +1000
        Re: How to turn a string into a list of integers? Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-04 20:09 -0600
        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 12:15 +1000
          Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 14:27 +1000
    Re: How to turn a string into a list of integers? obedrios@gmail.com - 2014-09-03 07:30 -0700

Page 1 of 2  [1] 2  Next page →


#77479 — How to turn a string into a list of integers?

Fromcl@isbd.net
Date2014-09-03 13:27 +0100
SubjectHow to turn a string into a list of integers?
Message-ID<h2ejdb-mdk.ln1@chris.zbmc.eu>
I know I can get a list of the characters in a string by simply doing:-

    listOfCharacters = list("This is a string")

... but how do I get a list of integers?

-- 
Chris Green
·

[toc] | [next] | [standalone]


#77480

FromPeter Otten <__peter__@web.de>
Date2014-09-03 14:52 +0200
Message-ID<mailman.13738.1409748804.18130.python-list@python.org>
In reply to#77479
cl@isbd.net wrote:

> I know I can get a list of the characters in a string by simply doing:-
> 
>     listOfCharacters = list("This is a string")
> 
> ... but how do I get a list of integers?
> 

>>> [ord(c) for c in "This is a string"]
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]

There are other ways, but you have to describe the use case and your Python 
version for us to recommend the most appropriate.

[toc] | [prev] | [next] | [standalone]


#77483

Fromcl@isbd.net
Date2014-09-03 15:48 +0100
Message-ID<1amjdb-p3n.ln1@chris.zbmc.eu>
In reply to#77480
Peter Otten <__peter__@web.de> wrote:
> cl@isbd.net wrote:
> 
> > I know I can get a list of the characters in a string by simply doing:-
> > 
> >     listOfCharacters = list("This is a string")
> > 
> > ... but how do I get a list of integers?
> > 
> 
> >>> [ord(c) for c in "This is a string"]
> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
> 
> There are other ways, but you have to describe the use case and your Python 
> version for us to recommend the most appropriate.
> 
That looks OK to me.  It's just for outputting a string to the block
write command in python-smbus which expects an integer array.

Thanks.

-- 
Chris Green
·

[toc] | [prev] | [next] | [standalone]


#77562

FromJoshua Landau <joshua@landau.ws>
Date2014-09-04 22:06 +0100
Message-ID<mailman.13776.1409864831.18130.python-list@python.org>
In reply to#77483
On 3 September 2014 15:48,  <cl@isbd.net> wrote:
> Peter Otten <__peter__@web.de> wrote:
>> >>> [ord(c) for c in "This is a string"]
>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
>>
>> There are other ways, but you have to describe the use case and your Python
>> version for us to recommend the most appropriate.
>>
> That looks OK to me.  It's just for outputting a string to the block
> write command in python-smbus which expects an integer array.

Just be careful about Unicode characters.

[toc] | [prev] | [next] | [standalone]


#77582

Fromcl@isbd.net
Date2014-09-05 09:42 +0100
Message-ID<1k9odb-1qs.ln1@chris.zbmc.eu>
In reply to#77562
Joshua Landau <joshua@landau.ws> wrote:
> On 3 September 2014 15:48,  <cl@isbd.net> wrote:
> > Peter Otten <__peter__@web.de> wrote:
> >> >>> [ord(c) for c in "This is a string"]
> >> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
> >>
> >> There are other ways, but you have to describe the use case and your Python
> >> version for us to recommend the most appropriate.
> >>
> > That looks OK to me.  It's just for outputting a string to the block
> > write command in python-smbus which expects an integer array.
> 
> Just be careful about Unicode characters.

I have to avoid them completely because I'm sending the string to a
character LCD with a limited 8-bit only character set.

-- 
Chris Green
·

[toc] | [prev] | [next] | [standalone]


#77603

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2014-09-05 19:56 +0200
Message-ID<mailman.13801.1409939785.18130.python-list@python.org>
In reply to#77582
Am 05.09.2014 um 10:42 schrieb cl@isbd.net:

> Joshua Landau <joshua@landau.ws> wrote:
>> On 3 September 2014 15:48,  <cl@isbd.net> wrote:
>>> Peter Otten <__peter__@web.de> wrote:
>>>>>>> [ord(c) for c in "This is a string"]
>>>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
>>>> 
>>>> There are other ways, but you have to describe the use case and your Python
>>>> version for us to recommend the most appropriate.
>>>> 
>>> That looks OK to me.  It's just for outputting a string to the block
>>> write command in python-smbus which expects an integer array.
>> 
>> Just be careful about Unicode characters.
> 
> I have to avoid them completely because I'm sending the string to a
> character LCD with a limited 8-bit only character set.


Could someone please explain the following behavior to me:
Python 2.7.7, MacOS 10.9 Mavericks

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> [ord(c) for c in 'AÄ']
[65, 195, 132]
>>> [ord(c) for c in u'AÄ']
[65, 196]

My obviously wrong understanding:
‚AÄ‘ in ‚ascii‘ are two characters
     one with ord A=65 and
     one with ord Ä=196 ISO8859-1 <depends on code table>
     —-> why [65, 195, 132]
u’AÄ’ is an Unicode string
     —-> why [65, 196]

It is just the other way round as I would expect.



Thank you
-- 
Kurt Mueller, kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#77636

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-06 15:47 +1000
Message-ID<540aa002$0$29968$c3e8da3$5496439d@news.astraweb.com>
In reply to#77603
Kurt Mueller wrote:

> Could someone please explain the following behavior to me:
> Python 2.7.7, MacOS 10.9 Mavericks
> 
>>>> import sys
>>>> sys.getdefaultencoding()
> 'ascii'

That's technically known as a "lie", since if it were *really* ASCII it
would refuse to deal with characters with the high-bit set. But it doesn't,
it treats them in an unpredictable and implementation-dependent manner.

>>>> [ord(c) for c in 'AÄ']
> [65, 195, 132]

In this case, it looks like your terminal is using UTF-8, so the character Ä
is represented in memory by bytes 195, 132:

py> u'Ä'.encode('utf-8')
'\xc3\x84'
py> for c in u'Ä'.encode('utf-8'):
...     print ord(c)
...
195
132

If your terminal was set to use a different encoding, you probably would
have got different results. When you type whatever key combination you used
to get Ä, your terminal receives the bytes 195, 132, and displays Ä. But
when Python processes those bytes, it's not expecting arbitrary Unicode
characters, it's expecting ASCII-ish bytes, and so treats it as two bytes
rather than a single character:

py> 'AÄ'
'A\xc3\x84'

That's not *really* ASCII, because ASCII doesn't include anything above 127,
but we can pretend that "ASCII plus arbitrary bytes between 128 and 256" is
just called ASCII. The important thing here is that although your terminal
is interpreting those two bytes \xc3\x84 (decimal 195, 132) as the
character Ä, it isn't anything of the sort. It's just two arbitrary bytes.

>>>> [ord(c) for c in u'AÄ']
> [65, 196]

Here, you have a proper Unicode string, so Python is expecting to receive
arbitrary Unicode characters and can treat the two bytes 195, 132 as Ä, and
that character has ordinal value 196:

py> ord(u"Ä")
196



> My obviously wrong understanding:
> ‚AÄ‘ in ‚ascii‘ are two characters
>      one with ord A=65 and
>      one with ord Ä=196 ISO8859-1 <depends on code table>

As soon as you start talking about code tables, *it isn't ASCII anymore*.
(Technically, ASCII *is* a code table, but it's one that only covers 127
different characters.)

When you type AÄ on your keyboard, or paste them, or however they were
entered, the *actual bytes* the terminal receives will vary, but regardless
of how they vary, the terminal *almost certainly* will interpret the first
byte (or possibly more than one byte, who knows?) as the ASCII character A.

(Most, but not all, code pages agree that byte 65 is A, 66 is B, and so on.)

The second (third? fifth?) byte, and possibly subsequent bytes, will
*probably* be displayed by the terminal as Ä, but Python only sees the raw
bytes. The important thing here is that unless you have some bizarre and
broken configuration, Python can correctly interpret the A as A, but what
you get for the Ä depends on the interaction of keyboard, OS, terminal and
the phase of the moon.

>      —-> why [65, 195, 132]

Since Python is expecting to interpret those bytes as an ASCII-ish byte
string, it grabs the raw bytes and ends up (in your case) with 65, 195,
132, or 'A\xc3\x84', even though your terminal displays it as AÄ.

This does not happen with Unicode strings.

> u’AÄ’ is an Unicode string
>      —-> why [65, 196]

In this case, Python knows that you are dealing with a Unicode string, and Ä
is a valid character in Unicode. Python deals with the internal details of
converting from whatever-damn-bytes your terminal sends it, and ends up
with a string of characters A followed by Ä.

If you could peer under the hood, and see what implementation Python uses to
store that string, you would see something version dependent. In Python
2.7, you would see an object more or less something vaguely like this:

[object header containing various fields]
[length = 2]
[array of bytes = 0x0041 0x00C4]


That's for a so-called "narrow build" of Python. If you have a "wide build",
it will something like this:

[object header containing various fields]
[length = 2]
[array of bytes = 0x00000041 0x000000C4]

In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have
something conceptually like this:

[object header containing various fields]
[length = 2]
[tag = one byte per character]
[array of bytes = 0x41 0xC4]

Some other implementations of Python could use UTF-8 internally:

[object header containing various fields]
[length = 2]
[array of bytes = 0x41 0xC3 0x84]


or even something more complex. But the important thing is, regardless of
the internal implementation, Python guarantees that a Unicode string is
treated as a fixed array of code points. Each code point has a value
between 0 and, not 127, not 255, not 65535, but 1114111.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77644

FromPeter Otten <__peter__@web.de>
Date2014-09-06 10:22 +0200
Message-ID<mailman.13826.1409991776.18130.python-list@python.org>
In reply to#77636
Steven D'Aprano wrote:

>>>>> import sys
>>>>> sys.getdefaultencoding()
>> 'ascii'
> 
> That's technically known as a "lie", since if it were *really* ASCII it
> would refuse to deal with characters with the high-bit set. But it
> doesn't, it treats them in an unpredictable and implementation-dependent
> manner.

It's not a lie, it just doesn't control the unicode-to-bytes conversion when 
printing:

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> print u"äöü"
äöü
>>> str(u"äöü")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: 
ordinal not in range(128)
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("latin1")
>>> print u"äöü"
äöü
>>> str(u"äöü")
'\xe4\xf6\xfc'
>>> sys.setdefaultencoding("utf-8")
>>> print u"äöü"
äöü
>>> str(u"äöü")
'\xc3\xa4\xc3\xb6\xc3\xbc'

You can enforce ascii-only printing:

$ LANG=C python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print unichr(228)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)

To find out the encoding that is used:

$ python -c 'import locale; print locale.getpreferredencoding()'
UTF-8
$ LANG=C python -c 'import locale; print locale.getpreferredencoding()'
ANSI_X3.4-1968

"""
Help on function getpreferredencoding in module locale:

getpreferredencoding(do_setlocale=True)
    Return the charset that the user is likely using,
    according to the system configuration.
"""

[toc] | [prev] | [next] | [standalone]


#77649

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-06 21:17 +1000
Message-ID<540aed58$0$29985$c3e8da3$5496439d@news.astraweb.com>
In reply to#77644
Peter Otten wrote:

> Steven D'Aprano wrote:
> 
>>>>>> import sys
>>>>>> sys.getdefaultencoding()
>>> 'ascii'
>> 
>> That's technically known as a "lie", since if it were *really* ASCII it
>> would refuse to deal with characters with the high-bit set. But it
>> doesn't, it treats them in an unpredictable and implementation-dependent
>> manner.
> 
> It's not a lie, it just doesn't control the unicode-to-bytes conversion
> when printing:

That's not what I'm referring to. I'm referring to this:

py> s
'\xff'


There is no such ASCII character (or code point, to steal terminology from
Unicode). ASCII is a 7-bit encoding, and includes 128 characters, with
ordinal values 0 through 127. Once you accept arbitrary bytes 128 through
255, it's no longer ASCII, it's ASCII plus undefined stuff.

(Historical note: the committee that designed ASCII *explicitly* rejected
making it an 8-bit code. They also considered, but rejected, using a 6-bit
code with a "shift" function.)


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77650

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2014-09-06 14:15 +0200
Message-ID<mailman.13833.1410005730.18130.python-list@python.org>
In reply to#77636
Am 06.09.2014 um 07:47 schrieb Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> Kurt Mueller wrote:
>> Could someone please explain the following behavior to me:
>> Python 2.7.7, MacOS 10.9 Mavericks

[snip]
Thanks for the detailed explanation. I think I understand a bit better now.


Now the part of the two Python builds is still somewhat unclear to me.

> If you could peer under the hood, and see what implementation Python uses to
> store that string, you would see something version dependent. In Python
> 2.7, you would see an object more or less something vaguely like this:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x0041 0x00C4]
> 
> 
> That's for a so-called "narrow build" of Python. If you have a "wide build",
> it will something like this:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x00000041 0x000000C4]
> 
> In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have
> something conceptually like this:
> 
> [object header containing various fields]
> [length = 2]
> [tag = one byte per character]
> [array of bytes = 0x41 0xC4]
> 
> Some other implementations of Python could use UTF-8 internally:
> 
> [object header containing various fields]
> [length = 2]
> [array of bytes = 0x41 0xC3 0x84]
> 
> 
> or even something more complex. But the important thing is, regardless of
> the internal implementation, Python guarantees that a Unicode string is
> treated as a fixed array of code points. Each code point has a value
> between 0 and, not 127, not 255, not 65535, but 1114111.



In Python 2.7:

As I learned from the ord() manual:
If a unicode argument is given and Python was built with UCS2 Unicode,
(I suppose this is the narrow build in your terms),
then the character’s code point must be in the range [0..65535] inclusive;

I understand: In a UCS2 build each character of a Unicode string uses
16 Bits and can represent code points from U-0000..U-FFFF.



From the unichr(i) manual I learn:
The valid range for the argument depends how Python was configured
– it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].

I understand: narrow build is UCS2, wide build is UCS4
- In a UCS2 build each character of an Unicode string uses 16 Bits and has 
  code points from U-0000..U-FFFF (0..65535)
- In a UCS4 build each character of an Unicode string uses 32 Bits and has 
  code points from U-00000000..U-0010FFFF (0..1114111)


Am I right?
-- 
Kurt Mueller, kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#77661

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-07 04:19 +1000
Message-ID<540b504a$0$29974$c3e8da3$5496439d@news.astraweb.com>
In reply to#77650
Kurt Mueller wrote:

[...]
> Now the part of the two Python builds is still somewhat unclear to me.
[...]
> In Python 2.7:
> 
> As I learned from the ord() manual:
> If a unicode argument is given and Python was built with UCS2 Unicode,

Where does the manual mention UCS-2? As far as I know, no version of Python
uses that.


> (I suppose this is the narrow build in your terms),

Mostly right, but not quite. "Narrow build" means that Python uses UTF-16,
not UCS-2, although the two are very similar. See below for further
details. But to make it more confusing, *parts* of Python (like the unichr
function) assume UCS-2, and refuse to accept values over 0xFFFF.


> then the character’s code point must be in the range [0..65535] inclusive;

Half-right. Unicode code points are always in the range U+0000 to U+10FFFF,
or in decimal, [0...1114111]. But, Python "narrow builds" don't quite
handle that correctly, and only half-support code points from
[65536...1114111]. The reasons are complicated, but see below.

UCS-2 is an implementation of an early, obsolete version of Unicode which is
limited to just 65536 characters (technically: "code points") instead of
the full range of 1114112 characters supported by Unicode.

UCS-2 is very similar to UTF-16. Both use a 16-bit "code unit" to represent
characters. In UCS-2, each character is represented by precisely 1 code
unit, numbered between 0 and 65535 (0x0000 and 0xFFFF in hex). In UTF-16,
the most common characters (the Basic Multilingual Plane) are likewise
represented by 1 code unit, between 0 and 65535, but there are a range
of "characters" (actually code points) which are reserved for use as
so-called "surrogate pairs". Using hex:

Code points U+0000 to U+D7FF:
    - represent the same character in UCS-2 and UTF-16;

Code points U+D800 to U+DFFF:
    - represent reserved but undefined characters in UCS-2;
    - represent surrogates in UTF-16 (see below);

Code points U+E000 to U+FFFF:
    - represent the same character in UCS-2 and UTF-16;

Code points U+010000 to U+10FFFF:
    - impossible to represent in UCS-2;
    - represented by TWO surrogates in UTF-16.

For example, the Unicode code point U+1D11E (MUSICAL SYMBOL G CLEF) cannot
be represented at all in UCS-2, because it is past U+FFFF. In UTF-16, it
cannot be represented as a single 16-bit code unit, instead it is
represented as two code-units, 0xD834 0xDD1E. That is called a "surrogate
pair".

The problem with Python's narrow builds is that, although characters are
variable width (the most common are 1 code unit, 16 bits, the rest are 2
code units), the Python implementation assumes that all characters are a
fixed 16 bits. So if your string is a single character like U+1D11E,
instead of treating it as a string of length one with ordinal value
0x1D11E, Python will treat it as a string of length *two* with ordinal
values 0xD834 and 0xDD1E.

(In other words, Python narrow builds fail to deal with surrogate pairs
correctly.)

Although you cannot create that string using unichr, you can create it using
the \U notation:

py> unichr(0x1D11E)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
py> u'\U0001D11E'
u'\U0001d11e'


> I understand: In a UCS2 build each character of a Unicode string uses
> 16 Bits and can represent code points from U-0000..U-FFFF.

That is correct. So UCS-2 can only represent a small subset of Unicode.


> From the unichr(i) manual I learn:
> The valid range for the argument depends how Python was configured
> – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].
> I understand: narrow build is UCS2, wide build is UCS4

UCS-4 is exactly the same as UTF-32, and wide builds use a fixed 32 bits for
every code point, so that's correct.


> - In a UCS2 build each character of an Unicode string uses 16 Bits and has
>   code points from U-0000..U-FFFF (0..65535)

As I said, it's not strictly correct, Python is actually using UTF-16, but
it's a buggy or incomplete UTF-16, with parts of the system assuming UCS-2.


> - In a UCS4 build each character of an Unicode string uses 32 Bits and has
>   code points from U-00000000..U-0010FFFF (0..1114111)

Correct. Remember that UCS-4 and UTF-32 are exactly the same: every code
point from U+0000 to U+10FFFF is represented by a single 32-bit value. So
our earlier example, U+1D11E (MUSICAL SYMBOL G CLEF) would be represented
as 0x0001D11E in UTF-32 and UCS-4.

Remember, though, these internal representations are (nearly) irrelevant to
Python code. In Python code, you just consider that a Unicode string is an
array of ordinal values from 0x0 to 0x10FFFF, each representing a single
code point U+0000 to U+10FFFF. The only reason I say "nearly" is that
narrow builds don't *quite* work right if the string contains surrogate
pairs.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77664

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2014-09-06 21:28 +0200
Message-ID<mailman.13842.1410031704.18130.python-list@python.org>
In reply to#77661
Am 06.09.2014 um 20:19 schrieb Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> Kurt Mueller wrote:
> [...]
>> Now the part of the two Python builds is still somewhat unclear to me.
> [...]
>> In Python 2.7:
>> As I learned from the ord() manual:
>> If a unicode argument is given and Python was built with UCS2 Unicode,
> Where does the manual mention UCS-2? As far as I know, no version of Python
> uses that.

https://docs.python.org/2/library/functions.html?highlight=ord#ord


[snip] very detailed explanation of narrow/wide build, UCS-2/UCS-4, UTF-16/UTF-32



> Remember, though, these internal representations are (nearly) irrelevant to
> Python code. In Python code, you just consider that a Unicode string is an
> array of ordinal values from 0x0 to 0x10FFFF, each representing a single
> code point U+0000 to U+10FFFF. The only reason I say "nearly" is that
> narrow builds don't *quite* work right if the string contains surrogate
> pairs.

So I can interpret your last section:
Processing any Unicode string will work with small and wide
python 2.7 builds and also with python >3.3?
( parts of small build python will not work with values over 0xFFFF )
( strings with surrogate pairs will not work correctly on small build python )



Many thanks for your detailed answer!
-- 
Kurt Mueller, kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#77666

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-07 11:47 +1000
Message-ID<540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com>
In reply to#77664
Kurt Mueller wrote:

> Processing any Unicode string will work with small and wide
> python 2.7 builds and also with python >3.3?
> ( parts of small build python will not work with values over 0xFFFF )
> ( strings with surrogate pairs will not work correctly on small build
> python )


If you limit yourself to code points in the Basic Multilingual Plane, U+0000
to U+FFFF, then Python's Unicode handling works fine no matter what version
or implementation is used. Since most people use only the BMP, you may not
notice any problems.

(Of course, there are performance and memory-usage differences from one
version to the next, but the functionality works correctly.)

If you use characters from the supplementary planes ("astral characters"),
then:

* wide builds will behave correctly;
* narrow builds will wrongly treat astral characters as two 
  independent characters, which means functions like len() 
  and string slicing will do the wrong thing;
* Python 3.3 doesn't use narrow and wide builds any more,
  and also behaves correctly with astral characters.


So there are three strategies for correct Unicode support in Python:

* avoid astral characters (and trust your users will also avoid them);

* use a wide build;

* use Python 3.3 or higher.


In case you are wondering what Python 3.3 does differently, when it builds a
string, it works out the largest code point in the string. If the largest
code point is no greater than U+00FF, it stores the string in Latin 1 using
8 bits per character; if the largest code point is no greater than U+FFFF,
then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
same); if the string contains any astral characters, then it uses UTF-32.
So regardless of the string, each character uses a single code unit. Only
the size of the code unit varies.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77672

FromMRAB <python@mrabarnett.plus.com>
Date2014-09-07 15:52 +0100
Message-ID<mailman.13849.1410101559.18130.python-list@python.org>
In reply to#77666
On 2014-09-07 02:47, Steven D'Aprano wrote:
> Kurt Mueller wrote:
>
>> Processing any Unicode string will work with small and wide
>> python 2.7 builds and also with python >3.3?
>> ( parts of small build python will not work with values over 0xFFFF )
>> ( strings with surrogate pairs will not work correctly on small build
>> python )
>
>
> If you limit yourself to code points in the Basic Multilingual Plane, U+0000
> to U+FFFF, then Python's Unicode handling works fine no matter what version
> or implementation is used. Since most people use only the BMP, you may not
> notice any problems.
>
> (Of course, there are performance and memory-usage differences from one
> version to the next, but the functionality works correctly.)
>
> If you use characters from the supplementary planes ("astral characters"),
> then:
>
> * wide builds will behave correctly;
> * narrow builds will wrongly treat astral characters as two
>    independent characters, which means functions like len()
>    and string slicing will do the wrong thing;
> * Python 3.3 doesn't use narrow and wide builds any more,
>    and also behaves correctly with astral characters.
>
>
> So there are three strategies for correct Unicode support in Python:
>
> * avoid astral characters (and trust your users will also avoid them);
>
> * use a wide build;
>
> * use Python 3.3 or higher.
>
>
> In case you are wondering what Python 3.3 does differently, when it builds a
> string, it works out the largest code point in the string. If the largest
> code point is no greater than U+00FF, it stores the string in Latin 1 using
> 8 bits per character; if the largest code point is no greater than U+FFFF,
> then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
> same); if the string contains any astral characters, then it uses UTF-32.
> So regardless of the string, each character uses a single code unit. Only
> the size of the code unit varies.
>
I don't think you should be saying that it stores the string in Latin-1
or UTF-16 because that might suggest that they are encoded. They aren't.

[toc] | [prev] | [next] | [standalone]


#77675

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-08 03:02 +1000
Message-ID<540c8fc4$0$29973$c3e8da3$5496439d@news.astraweb.com>
In reply to#77672
MRAB wrote:

> I don't think you should be saying that it stores the string in Latin-1
> or UTF-16 because that might suggest that they are encoded. They aren't.

Of course they are encoded. Memory consists of bytes, not Unicode code
points, which are abstract numbers representing characters (and other
things). You can't store "ξ" (U+03BE) in memory, you can only store a
particular representation of that "ξ" in bytes, and that representation is
called an encoding. Of course you can create whatever representation you
like, or you can use an established encoding rather than re-invent the
wheel. Here are four established encodings which support that code point,
and the bytes that are used:

py> u'ξ'.encode('iso-8859-7')
'\xee'
py> u'ξ'.encode('utf-8')
'\xce\xbe'
py> u'ξ'.encode('utf-16be')
'\x03\xbe'
py> u'ξ'.encode('utf-32be')
'\x00\x00\x03\xbe'



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77676

FromRustom Mody <rustompmody@gmail.com>
Date2014-09-07 10:53 -0700
Message-ID<8b80fe39-4aea-4a17-a1a6-a44f0b42fb7b@googlegroups.com>
In reply to#77675
On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote:
> MRAB wrote:

> > I don't think you should be saying that it stores the string in Latin-1
> > or UTF-16 because that might suggest that they are encoded. They aren't.

> Of course they are encoded. Memory consists of bytes, not Unicode code
> points, which are abstract numbers representing characters (and other
> things). You can't store "ξ" (U+03BE) in memory, you can only store a
> particular representation of that "ξ" in bytes, and that representation is
> called an encoding. Of course you can create whatever representation you
> like, or you can use an established encoding rather than re-invent the
> wheel. Here are four established encodings which support that code point,
> and the bytes that are used:

> py> u'ξ'.encode('iso-8859-7')
> '\xee'
> py> u'ξ'.encode('utf-8')
> '\xce\xbe'
> py> u'ξ'.encode('utf-16be')
> '\x03\xbe'
> py> u'ξ'.encode('utf-32be')
> '\x00\x00\x03\xbe'


Dunno about philosophical questions -- especially unicode :-)
What I can see (python 3) which is I guess what MRAB was pointing out:

>>> "".encode
<built-in method encode of str object at 0x7f3955da3848>

>>> "".decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

>>> b"".decode
<built-in method decode of bytes object at 0x7f39549fda08>

>>> b"".encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
>>> 

[toc] | [prev] | [next] | [standalone]


#77678

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-08 04:08 +1000
Message-ID<540c9f19$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to#77676
Rustom Mody wrote:

> On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote:
>> MRAB wrote:
> 
>> > I don't think you should be saying that it stores the string in Latin-1
>> > or UTF-16 because that might suggest that they are encoded. They
>> > aren't.
> 
>> Of course they are encoded. Memory consists of bytes, not Unicode code
>> points, [...]

> Dunno about philosophical questions -- especially unicode :-)
> What I can see (python 3) which is I guess what MRAB was pointing out:
> 
>>>> "".encode
> <built-in method encode of str object at 0x7f3955da3848>
> 
>>>> "".decode
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'str' object has no attribute 'decode'

What's your point? I'm talking about the implementation of how strings are
stored in memory, not what methods the str class provides.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77679

FromRustom Mody <rustompmody@gmail.com>
Date2014-09-07 11:34 -0700
Message-ID<c6fd8f40-06aa-4332-8f96-8801b8792f49@googlegroups.com>
In reply to#77678
On Sunday, September 7, 2014 11:38:41 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:

> > On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote:
> >> MRAB wrote:
> >> > I don't think you should be saying that it stores the string in Latin-1
> >> > or UTF-16 because that might suggest that they are encoded. They
> >> > aren't.
> >> Of course they are encoded. Memory consists of bytes, not Unicode code
> >> points, [...]

> > Dunno about philosophical questions -- especially unicode :-)
> > What I can see (python 3) which is I guess what MRAB was pointing out:
> >>>> "".encode
> >>>> "".decode
> > Traceback (most recent call last):
> > AttributeError: 'str' object has no attribute 'decode'

> What's your point? I'm talking about the implementation of how strings are
> stored in memory, not what methods the str class provides.

The methods (un)available reflect what're the (in)valid operations on
the type:

Strings

The items of a string object are Unicode code units.  Conversion from
and to other encodings are possible through the string method
encode().

Bytes

A bytes object is an immutable array. The items are 8-bit bytes,
represented by integers in the range 0 <= x < 256. Bytes literals
(like b'abc' and the built-in function bytes() can be used to
construct bytes objects. Also, bytes objects can be decoded to
strings via the decode() method.

From https://docs.python.org/3.1/reference/datamodel.html#the-standard-type-hierarchy



IOW I interpret MRAB's statement that strings should not be thought 
of as encoded because they consist of abstract code-points, seems to me (a unicode-ignoramus!) a reasonable outlook

[toc] | [prev] | [next] | [standalone]


#77691

FromChris Angelico <rosuav@gmail.com>
Date2014-09-08 10:14 +1000
Message-ID<mailman.13859.1410135272.18130.python-list@python.org>
In reply to#77679
On Mon, Sep 8, 2014 at 4:34 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> IOW I interpret MRAB's statement that strings should not be thought
> of as encoded because they consist of abstract code-points, seems to me (a unicode-ignoramus!) a reasonable outlook

The original question was regarding storage - how PEP 393 says that
strings will be encoded in memory in any of three ways (Latin-1,
UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not
what a string *is*, but only what it is made of.

ChrisA

[toc] | [prev] | [next] | [standalone]


#77693

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-09-08 08:44 +0300
Message-ID<8738c2ekex.fsf@elektro.pacujo.net>
In reply to#77691
Chris Angelico <rosuav@gmail.com>:

> The original question was regarding storage - how PEP 393 says that
> strings will be encoded in memory in any of three ways (Latin-1,
> UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not
> what a string *is*, but only what it is made of.

I'm a bit surprised that kind of CPython implementation detail would go
into a PEP. I had thought PEPs codified Python independently of CPython.

But maybe CPython is to Python what England is to the UK: even the
government is having a hard time making a distinction.


Marko

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web