Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #77479 > unrolled thread
| Started by | cl@isbd.net |
|---|---|
| First post | 2014-09-03 13:27 +0100 |
| Last post | 2014-09-03 07:30 -0700 |
| Articles | 20 on this page of 35 — 14 participants |
Back to article view | Back to comp.lang.python
How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 13:27 +0100
Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-03 14:52 +0200
Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 15:48 +0100
Re: How to turn a string into a list of integers? Joshua Landau <joshua@landau.ws> - 2014-09-04 22:06 +0100
Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-05 09:42 +0100
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 19:56 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 15:47 +1000
Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-06 10:22 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 21:17 +1000
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 14:15 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 04:19 +1000
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 21:28 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 11:47 +1000
Re: How to turn a string into a list of integers? MRAB <python@mrabarnett.plus.com> - 2014-09-07 15:52 +0100
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 03:02 +1000
Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 10:53 -0700
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:08 +1000
Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 11:34 -0700
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:14 +1000
Re: How to turn a string into a list of integers? Marko Rauhamaa <marko@pacujo.net> - 2014-09-08 08:44 +0300
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 15:53 +1000
Re: How to turn a string into a list of integers? Terry Reedy <tjreedy@udel.edu> - 2014-09-08 03:41 -0400
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 01:04 +1000
Re: How to turn a string into a list of integers? Roy Smith <roy@panix.com> - 2014-09-07 11:40 -0400
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:00 +1000
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:12 +1000
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-06 22:23 +1000
Re: How to turn a string into a list of integers? Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2014-09-05 20:25 +0200
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 21:16 +0200
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 22:41 +0200
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 10:12 +1000
Re: How to turn a string into a list of integers? Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-04 20:09 -0600
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 12:15 +1000
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 14:27 +1000
Re: How to turn a string into a list of integers? obedrios@gmail.com - 2014-09-03 07:30 -0700
Page 1 of 2 [1] 2 Next page →
| From | cl@isbd.net |
|---|---|
| Date | 2014-09-03 13:27 +0100 |
| Subject | How to turn a string into a list of integers? |
| Message-ID | <h2ejdb-mdk.ln1@chris.zbmc.eu> |
I know I can get a list of the characters in a string by simply doing:-
listOfCharacters = list("This is a string")
... but how do I get a list of integers?
--
Chris Green
·
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-09-03 14:52 +0200 |
| Message-ID | <mailman.13738.1409748804.18130.python-list@python.org> |
| In reply to | #77479 |
cl@isbd.net wrote:
> I know I can get a list of the characters in a string by simply doing:-
>
> listOfCharacters = list("This is a string")
>
> ... but how do I get a list of integers?
>
>>> [ord(c) for c in "This is a string"]
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
There are other ways, but you have to describe the use case and your Python
version for us to recommend the most appropriate.
[toc] | [prev] | [next] | [standalone]
| From | cl@isbd.net |
|---|---|
| Date | 2014-09-03 15:48 +0100 |
| Message-ID | <1amjdb-p3n.ln1@chris.zbmc.eu> |
| In reply to | #77480 |
Peter Otten <__peter__@web.de> wrote:
> cl@isbd.net wrote:
>
> > I know I can get a list of the characters in a string by simply doing:-
> >
> > listOfCharacters = list("This is a string")
> >
> > ... but how do I get a list of integers?
> >
>
> >>> [ord(c) for c in "This is a string"]
> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
>
> There are other ways, but you have to describe the use case and your Python
> version for us to recommend the most appropriate.
>
That looks OK to me. It's just for outputting a string to the block
write command in python-smbus which expects an integer array.
Thanks.
--
Chris Green
·
[toc] | [prev] | [next] | [standalone]
| From | Joshua Landau <joshua@landau.ws> |
|---|---|
| Date | 2014-09-04 22:06 +0100 |
| Message-ID | <mailman.13776.1409864831.18130.python-list@python.org> |
| In reply to | #77483 |
On 3 September 2014 15:48, <cl@isbd.net> wrote: > Peter Otten <__peter__@web.de> wrote: >> >>> [ord(c) for c in "This is a string"] >> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103] >> >> There are other ways, but you have to describe the use case and your Python >> version for us to recommend the most appropriate. >> > That looks OK to me. It's just for outputting a string to the block > write command in python-smbus which expects an integer array. Just be careful about Unicode characters.
[toc] | [prev] | [next] | [standalone]
| From | cl@isbd.net |
|---|---|
| Date | 2014-09-05 09:42 +0100 |
| Message-ID | <1k9odb-1qs.ln1@chris.zbmc.eu> |
| In reply to | #77562 |
Joshua Landau <joshua@landau.ws> wrote: > On 3 September 2014 15:48, <cl@isbd.net> wrote: > > Peter Otten <__peter__@web.de> wrote: > >> >>> [ord(c) for c in "This is a string"] > >> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103] > >> > >> There are other ways, but you have to describe the use case and your Python > >> version for us to recommend the most appropriate. > >> > > That looks OK to me. It's just for outputting a string to the block > > write command in python-smbus which expects an integer array. > > Just be careful about Unicode characters. I have to avoid them completely because I'm sending the string to a character LCD with a limited 8-bit only character set. -- Chris Green ·
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2014-09-05 19:56 +0200 |
| Message-ID | <mailman.13801.1409939785.18130.python-list@python.org> |
| In reply to | #77582 |
Am 05.09.2014 um 10:42 schrieb cl@isbd.net:
> Joshua Landau <joshua@landau.ws> wrote:
>> On 3 September 2014 15:48, <cl@isbd.net> wrote:
>>> Peter Otten <__peter__@web.de> wrote:
>>>>>>> [ord(c) for c in "This is a string"]
>>>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
>>>>
>>>> There are other ways, but you have to describe the use case and your Python
>>>> version for us to recommend the most appropriate.
>>>>
>>> That looks OK to me. It's just for outputting a string to the block
>>> write command in python-smbus which expects an integer array.
>>
>> Just be careful about Unicode characters.
>
> I have to avoid them completely because I'm sending the string to a
> character LCD with a limited 8-bit only character set.
Could someone please explain the following behavior to me:
Python 2.7.7, MacOS 10.9 Mavericks
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> [ord(c) for c in 'AÄ']
[65, 195, 132]
>>> [ord(c) for c in u'AÄ']
[65, 196]
My obviously wrong understanding:
‚AÄ‘ in ‚ascii‘ are two characters
one with ord A=65 and
one with ord Ä=196 ISO8859-1 <depends on code table>
—-> why [65, 195, 132]
u’AÄ’ is an Unicode string
—-> why [65, 196]
It is just the other way round as I would expect.
Thank you
--
Kurt Mueller, kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-06 15:47 +1000 |
| Message-ID | <540aa002$0$29968$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77603 |
Kurt Mueller wrote:
> Could someone please explain the following behavior to me:
> Python 2.7.7, MacOS 10.9 Mavericks
>
>>>> import sys
>>>> sys.getdefaultencoding()
> 'ascii'
That's technically known as a "lie", since if it were *really* ASCII it
would refuse to deal with characters with the high-bit set. But it doesn't,
it treats them in an unpredictable and implementation-dependent manner.
>>>> [ord(c) for c in 'AÄ']
> [65, 195, 132]
In this case, it looks like your terminal is using UTF-8, so the character Ä
is represented in memory by bytes 195, 132:
py> u'Ä'.encode('utf-8')
'\xc3\x84'
py> for c in u'Ä'.encode('utf-8'):
... print ord(c)
...
195
132
If your terminal was set to use a different encoding, you probably would
have got different results. When you type whatever key combination you used
to get Ä, your terminal receives the bytes 195, 132, and displays Ä. But
when Python processes those bytes, it's not expecting arbitrary Unicode
characters, it's expecting ASCII-ish bytes, and so treats it as two bytes
rather than a single character:
py> 'AÄ'
'A\xc3\x84'
That's not *really* ASCII, because ASCII doesn't include anything above 127,
but we can pretend that "ASCII plus arbitrary bytes between 128 and 256" is
just called ASCII. The important thing here is that although your terminal
is interpreting those two bytes \xc3\x84 (decimal 195, 132) as the
character Ä, it isn't anything of the sort. It's just two arbitrary bytes.
>>>> [ord(c) for c in u'AÄ']
> [65, 196]
Here, you have a proper Unicode string, so Python is expecting to receive
arbitrary Unicode characters and can treat the two bytes 195, 132 as Ä, and
that character has ordinal value 196:
py> ord(u"Ä")
196
> My obviously wrong understanding:
> ‚AÄ‘ in ‚ascii‘ are two characters
> one with ord A=65 and
> one with ord Ä=196 ISO8859-1 <depends on code table>
As soon as you start talking about code tables, *it isn't ASCII anymore*.
(Technically, ASCII *is* a code table, but it's one that only covers 127
different characters.)
When you type AÄ on your keyboard, or paste them, or however they were
entered, the *actual bytes* the terminal receives will vary, but regardless
of how they vary, the terminal *almost certainly* will interpret the first
byte (or possibly more than one byte, who knows?) as the ASCII character A.
(Most, but not all, code pages agree that byte 65 is A, 66 is B, and so on.)
The second (third? fifth?) byte, and possibly subsequent bytes, will
*probably* be displayed by the terminal as Ä, but Python only sees the raw
bytes. The important thing here is that unless you have some bizarre and
broken configuration, Python can correctly interpret the A as A, but what
you get for the Ä depends on the interaction of keyboard, OS, terminal and
the phase of the moon.
> —-> why [65, 195, 132]
Since Python is expecting to interpret those bytes as an ASCII-ish byte
string, it grabs the raw bytes and ends up (in your case) with 65, 195,
132, or 'A\xc3\x84', even though your terminal displays it as AÄ.
This does not happen with Unicode strings.
> u’AÄ’ is an Unicode string
> —-> why [65, 196]
In this case, Python knows that you are dealing with a Unicode string, and Ä
is a valid character in Unicode. Python deals with the internal details of
converting from whatever-damn-bytes your terminal sends it, and ends up
with a string of characters A followed by Ä.
If you could peer under the hood, and see what implementation Python uses to
store that string, you would see something version dependent. In Python
2.7, you would see an object more or less something vaguely like this:
[object header containing various fields]
[length = 2]
[array of bytes = 0x0041 0x00C4]
That's for a so-called "narrow build" of Python. If you have a "wide build",
it will something like this:
[object header containing various fields]
[length = 2]
[array of bytes = 0x00000041 0x000000C4]
In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have
something conceptually like this:
[object header containing various fields]
[length = 2]
[tag = one byte per character]
[array of bytes = 0x41 0xC4]
Some other implementations of Python could use UTF-8 internally:
[object header containing various fields]
[length = 2]
[array of bytes = 0x41 0xC3 0x84]
or even something more complex. But the important thing is, regardless of
the internal implementation, Python guarantees that a Unicode string is
treated as a fixed array of code points. Each code point has a value
between 0 and, not 127, not 255, not 65535, but 1114111.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-09-06 10:22 +0200 |
| Message-ID | <mailman.13826.1409991776.18130.python-list@python.org> |
| In reply to | #77636 |
Steven D'Aprano wrote:
>>>>> import sys
>>>>> sys.getdefaultencoding()
>> 'ascii'
>
> That's technically known as a "lie", since if it were *really* ASCII it
> would refuse to deal with characters with the high-bit set. But it
> doesn't, it treats them in an unpredictable and implementation-dependent
> manner.
It's not a lie, it just doesn't control the unicode-to-bytes conversion when
printing:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> print u"äöü"
äöü
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("latin1")
>>> print u"äöü"
äöü
>>> str(u"äöü")
'\xe4\xf6\xfc'
>>> sys.setdefaultencoding("utf-8")
>>> print u"äöü"
äöü
>>> str(u"äöü")
'\xc3\xa4\xc3\xb6\xc3\xbc'
You can enforce ascii-only printing:
$ LANG=C python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print unichr(228)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
0: ordinal not in range(128)
To find out the encoding that is used:
$ python -c 'import locale; print locale.getpreferredencoding()'
UTF-8
$ LANG=C python -c 'import locale; print locale.getpreferredencoding()'
ANSI_X3.4-1968
"""
Help on function getpreferredencoding in module locale:
getpreferredencoding(do_setlocale=True)
Return the charset that the user is likely using,
according to the system configuration.
"""
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-06 21:17 +1000 |
| Message-ID | <540aed58$0$29985$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77644 |
Peter Otten wrote: > Steven D'Aprano wrote: > >>>>>> import sys >>>>>> sys.getdefaultencoding() >>> 'ascii' >> >> That's technically known as a "lie", since if it were *really* ASCII it >> would refuse to deal with characters with the high-bit set. But it >> doesn't, it treats them in an unpredictable and implementation-dependent >> manner. > > It's not a lie, it just doesn't control the unicode-to-bytes conversion > when printing: That's not what I'm referring to. I'm referring to this: py> s '\xff' There is no such ASCII character (or code point, to steal terminology from Unicode). ASCII is a 7-bit encoding, and includes 128 characters, with ordinal values 0 through 127. Once you accept arbitrary bytes 128 through 255, it's no longer ASCII, it's ASCII plus undefined stuff. (Historical note: the committee that designed ASCII *explicitly* rejected making it an 8-bit code. They also considered, but rejected, using a 6-bit code with a "shift" function.) -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2014-09-06 14:15 +0200 |
| Message-ID | <mailman.13833.1410005730.18130.python-list@python.org> |
| In reply to | #77636 |
Am 06.09.2014 um 07:47 schrieb Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > Kurt Mueller wrote: >> Could someone please explain the following behavior to me: >> Python 2.7.7, MacOS 10.9 Mavericks [snip] Thanks for the detailed explanation. I think I understand a bit better now. Now the part of the two Python builds is still somewhat unclear to me. > If you could peer under the hood, and see what implementation Python uses to > store that string, you would see something version dependent. In Python > 2.7, you would see an object more or less something vaguely like this: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x0041 0x00C4] > > > That's for a so-called "narrow build" of Python. If you have a "wide build", > it will something like this: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x00000041 0x000000C4] > > In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have > something conceptually like this: > > [object header containing various fields] > [length = 2] > [tag = one byte per character] > [array of bytes = 0x41 0xC4] > > Some other implementations of Python could use UTF-8 internally: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x41 0xC3 0x84] > > > or even something more complex. But the important thing is, regardless of > the internal implementation, Python guarantees that a Unicode string is > treated as a fixed array of code points. Each code point has a value > between 0 and, not 127, not 255, not 65535, but 1114111. In Python 2.7: As I learned from the ord() manual: If a unicode argument is given and Python was built with UCS2 Unicode, (I suppose this is the narrow build in your terms), then the character’s code point must be in the range [0..65535] inclusive; I understand: In a UCS2 build each character of a Unicode string uses 16 Bits and can represent code points from U-0000..U-FFFF. From the unichr(i) manual I learn: The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. I understand: narrow build is UCS2, wide build is UCS4 - In a UCS2 build each character of an Unicode string uses 16 Bits and has code points from U-0000..U-FFFF (0..65535) - In a UCS4 build each character of an Unicode string uses 32 Bits and has code points from U-00000000..U-0010FFFF (0..1114111) Am I right? -- Kurt Mueller, kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-07 04:19 +1000 |
| Message-ID | <540b504a$0$29974$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77650 |
Kurt Mueller wrote:
[...]
> Now the part of the two Python builds is still somewhat unclear to me.
[...]
> In Python 2.7:
>
> As I learned from the ord() manual:
> If a unicode argument is given and Python was built with UCS2 Unicode,
Where does the manual mention UCS-2? As far as I know, no version of Python
uses that.
> (I suppose this is the narrow build in your terms),
Mostly right, but not quite. "Narrow build" means that Python uses UTF-16,
not UCS-2, although the two are very similar. See below for further
details. But to make it more confusing, *parts* of Python (like the unichr
function) assume UCS-2, and refuse to accept values over 0xFFFF.
> then the character’s code point must be in the range [0..65535] inclusive;
Half-right. Unicode code points are always in the range U+0000 to U+10FFFF,
or in decimal, [0...1114111]. But, Python "narrow builds" don't quite
handle that correctly, and only half-support code points from
[65536...1114111]. The reasons are complicated, but see below.
UCS-2 is an implementation of an early, obsolete version of Unicode which is
limited to just 65536 characters (technically: "code points") instead of
the full range of 1114112 characters supported by Unicode.
UCS-2 is very similar to UTF-16. Both use a 16-bit "code unit" to represent
characters. In UCS-2, each character is represented by precisely 1 code
unit, numbered between 0 and 65535 (0x0000 and 0xFFFF in hex). In UTF-16,
the most common characters (the Basic Multilingual Plane) are likewise
represented by 1 code unit, between 0 and 65535, but there are a range
of "characters" (actually code points) which are reserved for use as
so-called "surrogate pairs". Using hex:
Code points U+0000 to U+D7FF:
- represent the same character in UCS-2 and UTF-16;
Code points U+D800 to U+DFFF:
- represent reserved but undefined characters in UCS-2;
- represent surrogates in UTF-16 (see below);
Code points U+E000 to U+FFFF:
- represent the same character in UCS-2 and UTF-16;
Code points U+010000 to U+10FFFF:
- impossible to represent in UCS-2;
- represented by TWO surrogates in UTF-16.
For example, the Unicode code point U+1D11E (MUSICAL SYMBOL G CLEF) cannot
be represented at all in UCS-2, because it is past U+FFFF. In UTF-16, it
cannot be represented as a single 16-bit code unit, instead it is
represented as two code-units, 0xD834 0xDD1E. That is called a "surrogate
pair".
The problem with Python's narrow builds is that, although characters are
variable width (the most common are 1 code unit, 16 bits, the rest are 2
code units), the Python implementation assumes that all characters are a
fixed 16 bits. So if your string is a single character like U+1D11E,
instead of treating it as a string of length one with ordinal value
0x1D11E, Python will treat it as a string of length *two* with ordinal
values 0xD834 and 0xDD1E.
(In other words, Python narrow builds fail to deal with surrogate pairs
correctly.)
Although you cannot create that string using unichr, you can create it using
the \U notation:
py> unichr(0x1D11E)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
py> u'\U0001D11E'
u'\U0001d11e'
> I understand: In a UCS2 build each character of a Unicode string uses
> 16 Bits and can represent code points from U-0000..U-FFFF.
That is correct. So UCS-2 can only represent a small subset of Unicode.
> From the unichr(i) manual I learn:
> The valid range for the argument depends how Python was configured
> – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].
> I understand: narrow build is UCS2, wide build is UCS4
UCS-4 is exactly the same as UTF-32, and wide builds use a fixed 32 bits for
every code point, so that's correct.
> - In a UCS2 build each character of an Unicode string uses 16 Bits and has
> code points from U-0000..U-FFFF (0..65535)
As I said, it's not strictly correct, Python is actually using UTF-16, but
it's a buggy or incomplete UTF-16, with parts of the system assuming UCS-2.
> - In a UCS4 build each character of an Unicode string uses 32 Bits and has
> code points from U-00000000..U-0010FFFF (0..1114111)
Correct. Remember that UCS-4 and UTF-32 are exactly the same: every code
point from U+0000 to U+10FFFF is represented by a single 32-bit value. So
our earlier example, U+1D11E (MUSICAL SYMBOL G CLEF) would be represented
as 0x0001D11E in UTF-32 and UCS-4.
Remember, though, these internal representations are (nearly) irrelevant to
Python code. In Python code, you just consider that a Unicode string is an
array of ordinal values from 0x0 to 0x10FFFF, each representing a single
code point U+0000 to U+10FFFF. The only reason I say "nearly" is that
narrow builds don't *quite* work right if the string contains surrogate
pairs.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2014-09-06 21:28 +0200 |
| Message-ID | <mailman.13842.1410031704.18130.python-list@python.org> |
| In reply to | #77661 |
Am 06.09.2014 um 20:19 schrieb Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > Kurt Mueller wrote: > [...] >> Now the part of the two Python builds is still somewhat unclear to me. > [...] >> In Python 2.7: >> As I learned from the ord() manual: >> If a unicode argument is given and Python was built with UCS2 Unicode, > Where does the manual mention UCS-2? As far as I know, no version of Python > uses that. https://docs.python.org/2/library/functions.html?highlight=ord#ord [snip] very detailed explanation of narrow/wide build, UCS-2/UCS-4, UTF-16/UTF-32 > Remember, though, these internal representations are (nearly) irrelevant to > Python code. In Python code, you just consider that a Unicode string is an > array of ordinal values from 0x0 to 0x10FFFF, each representing a single > code point U+0000 to U+10FFFF. The only reason I say "nearly" is that > narrow builds don't *quite* work right if the string contains surrogate > pairs. So I can interpret your last section: Processing any Unicode string will work with small and wide python 2.7 builds and also with python >3.3? ( parts of small build python will not work with values over 0xFFFF ) ( strings with surrogate pairs will not work correctly on small build python ) Many thanks for your detailed answer! -- Kurt Mueller, kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-07 11:47 +1000 |
| Message-ID | <540bb91c$0$29969$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77664 |
Kurt Mueller wrote:
> Processing any Unicode string will work with small and wide
> python 2.7 builds and also with python >3.3?
> ( parts of small build python will not work with values over 0xFFFF )
> ( strings with surrogate pairs will not work correctly on small build
> python )
If you limit yourself to code points in the Basic Multilingual Plane, U+0000
to U+FFFF, then Python's Unicode handling works fine no matter what version
or implementation is used. Since most people use only the BMP, you may not
notice any problems.
(Of course, there are performance and memory-usage differences from one
version to the next, but the functionality works correctly.)
If you use characters from the supplementary planes ("astral characters"),
then:
* wide builds will behave correctly;
* narrow builds will wrongly treat astral characters as two
independent characters, which means functions like len()
and string slicing will do the wrong thing;
* Python 3.3 doesn't use narrow and wide builds any more,
and also behaves correctly with astral characters.
So there are three strategies for correct Unicode support in Python:
* avoid astral characters (and trust your users will also avoid them);
* use a wide build;
* use Python 3.3 or higher.
In case you are wondering what Python 3.3 does differently, when it builds a
string, it works out the largest code point in the string. If the largest
code point is no greater than U+00FF, it stores the string in Latin 1 using
8 bits per character; if the largest code point is no greater than U+FFFF,
then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
same); if the string contains any astral characters, then it uses UTF-32.
So regardless of the string, each character uses a single code unit. Only
the size of the code unit varies.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2014-09-07 15:52 +0100 |
| Message-ID | <mailman.13849.1410101559.18130.python-list@python.org> |
| In reply to | #77666 |
On 2014-09-07 02:47, Steven D'Aprano wrote:
> Kurt Mueller wrote:
>
>> Processing any Unicode string will work with small and wide
>> python 2.7 builds and also with python >3.3?
>> ( parts of small build python will not work with values over 0xFFFF )
>> ( strings with surrogate pairs will not work correctly on small build
>> python )
>
>
> If you limit yourself to code points in the Basic Multilingual Plane, U+0000
> to U+FFFF, then Python's Unicode handling works fine no matter what version
> or implementation is used. Since most people use only the BMP, you may not
> notice any problems.
>
> (Of course, there are performance and memory-usage differences from one
> version to the next, but the functionality works correctly.)
>
> If you use characters from the supplementary planes ("astral characters"),
> then:
>
> * wide builds will behave correctly;
> * narrow builds will wrongly treat astral characters as two
> independent characters, which means functions like len()
> and string slicing will do the wrong thing;
> * Python 3.3 doesn't use narrow and wide builds any more,
> and also behaves correctly with astral characters.
>
>
> So there are three strategies for correct Unicode support in Python:
>
> * avoid astral characters (and trust your users will also avoid them);
>
> * use a wide build;
>
> * use Python 3.3 or higher.
>
>
> In case you are wondering what Python 3.3 does differently, when it builds a
> string, it works out the largest code point in the string. If the largest
> code point is no greater than U+00FF, it stores the string in Latin 1 using
> 8 bits per character; if the largest code point is no greater than U+FFFF,
> then it uses UTF-16 (or UCS-2, since with the BMP they are functionally the
> same); if the string contains any astral characters, then it uses UTF-32.
> So regardless of the string, each character uses a single code unit. Only
> the size of the code unit varies.
>
I don't think you should be saying that it stores the string in Latin-1
or UTF-16 because that might suggest that they are encoded. They aren't.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-08 03:02 +1000 |
| Message-ID | <540c8fc4$0$29973$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77672 |
MRAB wrote:
> I don't think you should be saying that it stores the string in Latin-1
> or UTF-16 because that might suggest that they are encoded. They aren't.
Of course they are encoded. Memory consists of bytes, not Unicode code
points, which are abstract numbers representing characters (and other
things). You can't store "ξ" (U+03BE) in memory, you can only store a
particular representation of that "ξ" in bytes, and that representation is
called an encoding. Of course you can create whatever representation you
like, or you can use an established encoding rather than re-invent the
wheel. Here are four established encodings which support that code point,
and the bytes that are used:
py> u'ξ'.encode('iso-8859-7')
'\xee'
py> u'ξ'.encode('utf-8')
'\xce\xbe'
py> u'ξ'.encode('utf-16be')
'\x03\xbe'
py> u'ξ'.encode('utf-32be')
'\x00\x00\x03\xbe'
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-09-07 10:53 -0700 |
| Message-ID | <8b80fe39-4aea-4a17-a1a6-a44f0b42fb7b@googlegroups.com> |
| In reply to | #77675 |
On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote:
> MRAB wrote:
> > I don't think you should be saying that it stores the string in Latin-1
> > or UTF-16 because that might suggest that they are encoded. They aren't.
> Of course they are encoded. Memory consists of bytes, not Unicode code
> points, which are abstract numbers representing characters (and other
> things). You can't store "ξ" (U+03BE) in memory, you can only store a
> particular representation of that "ξ" in bytes, and that representation is
> called an encoding. Of course you can create whatever representation you
> like, or you can use an established encoding rather than re-invent the
> wheel. Here are four established encodings which support that code point,
> and the bytes that are used:
> py> u'ξ'.encode('iso-8859-7')
> '\xee'
> py> u'ξ'.encode('utf-8')
> '\xce\xbe'
> py> u'ξ'.encode('utf-16be')
> '\x03\xbe'
> py> u'ξ'.encode('utf-32be')
> '\x00\x00\x03\xbe'
Dunno about philosophical questions -- especially unicode :-)
What I can see (python 3) which is I guess what MRAB was pointing out:
>>> "".encode
<built-in method encode of str object at 0x7f3955da3848>
>>> "".decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b"".decode
<built-in method decode of bytes object at 0x7f39549fda08>
>>> b"".encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
>>>
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-08 04:08 +1000 |
| Message-ID | <540c9f19$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77676 |
Rustom Mody wrote: > On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote: >> MRAB wrote: > >> > I don't think you should be saying that it stores the string in Latin-1 >> > or UTF-16 because that might suggest that they are encoded. They >> > aren't. > >> Of course they are encoded. Memory consists of bytes, not Unicode code >> points, [...] > Dunno about philosophical questions -- especially unicode :-) > What I can see (python 3) which is I guess what MRAB was pointing out: > >>>> "".encode > <built-in method encode of str object at 0x7f3955da3848> > >>>> "".decode > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > AttributeError: 'str' object has no attribute 'decode' What's your point? I'm talking about the implementation of how strings are stored in memory, not what methods the str class provides. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-09-07 11:34 -0700 |
| Message-ID | <c6fd8f40-06aa-4332-8f96-8801b8792f49@googlegroups.com> |
| In reply to | #77678 |
On Sunday, September 7, 2014 11:38:41 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > On Sunday, September 7, 2014 10:33:26 PM UTC+5:30, Steven D'Aprano wrote: > >> MRAB wrote: > >> > I don't think you should be saying that it stores the string in Latin-1 > >> > or UTF-16 because that might suggest that they are encoded. They > >> > aren't. > >> Of course they are encoded. Memory consists of bytes, not Unicode code > >> points, [...] > > Dunno about philosophical questions -- especially unicode :-) > > What I can see (python 3) which is I guess what MRAB was pointing out: > >>>> "".encode > >>>> "".decode > > Traceback (most recent call last): > > AttributeError: 'str' object has no attribute 'decode' > What's your point? I'm talking about the implementation of how strings are > stored in memory, not what methods the str class provides. The methods (un)available reflect what're the (in)valid operations on the type: Strings The items of a string object are Unicode code units. Conversion from and to other encodings are possible through the string method encode(). Bytes A bytes object is an immutable array. The items are 8-bit bytes, represented by integers in the range 0 <= x < 256. Bytes literals (like b'abc' and the built-in function bytes() can be used to construct bytes objects. Also, bytes objects can be decoded to strings via the decode() method. From https://docs.python.org/3.1/reference/datamodel.html#the-standard-type-hierarchy IOW I interpret MRAB's statement that strings should not be thought of as encoded because they consist of abstract code-points, seems to me (a unicode-ignoramus!) a reasonable outlook
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-08 10:14 +1000 |
| Message-ID | <mailman.13859.1410135272.18130.python-list@python.org> |
| In reply to | #77679 |
On Mon, Sep 8, 2014 at 4:34 AM, Rustom Mody <rustompmody@gmail.com> wrote: > IOW I interpret MRAB's statement that strings should not be thought > of as encoded because they consist of abstract code-points, seems to me (a unicode-ignoramus!) a reasonable outlook The original question was regarding storage - how PEP 393 says that strings will be encoded in memory in any of three ways (Latin-1, UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not what a string *is*, but only what it is made of. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-09-08 08:44 +0300 |
| Message-ID | <8738c2ekex.fsf@elektro.pacujo.net> |
| In reply to | #77691 |
Chris Angelico <rosuav@gmail.com>: > The original question was regarding storage - how PEP 393 says that > strings will be encoded in memory in any of three ways (Latin-1, > UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not > what a string *is*, but only what it is made of. I'm a bit surprised that kind of CPython implementation detail would go into a PEP. I had thought PEPs codified Python independently of CPython. But maybe CPython is to Python what England is to the UK: even the government is having a hard time making a distinction. Marko
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web