Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #77479 > unrolled thread

How to turn a string into a list of integers?

Started bycl@isbd.net
First post2014-09-03 13:27 +0100
Last post2014-09-03 07:30 -0700
Articles 15 on this page of 35 — 14 participants

Back to article view | Back to comp.lang.python


Contents

  How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 13:27 +0100
    Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-03 14:52 +0200
      Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 15:48 +0100
        Re: How to turn a string into a list of integers? Joshua Landau <joshua@landau.ws> - 2014-09-04 22:06 +0100
          Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-05 09:42 +0100
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 19:56 +0200
              Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 15:47 +1000
                Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-06 10:22 +0200
                  Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 21:17 +1000
                Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 14:15 +0200
                  Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 04:19 +1000
                    Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 21:28 +0200
                      Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 11:47 +1000
                        Re: How to turn a string into a list of integers? MRAB <python@mrabarnett.plus.com> - 2014-09-07 15:52 +0100
                          Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 03:02 +1000
                            Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 10:53 -0700
                              Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:08 +1000
                                Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 11:34 -0700
                                  Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:14 +1000
                                    Re: How to turn a string into a list of integers? Marko Rauhamaa <marko@pacujo.net> - 2014-09-08 08:44 +0300
                                      Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 15:53 +1000
                                      Re: How to turn a string into a list of integers? Terry Reedy <tjreedy@udel.edu> - 2014-09-08 03:41 -0400
                        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 01:04 +1000
                          Re: How to turn a string into a list of integers? Roy Smith <roy@panix.com> - 2014-09-07 11:40 -0400
                            Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:00 +1000
                            Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:12 +1000
                Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-06 22:23 +1000
            Re: How to turn a string into a list of integers? Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2014-09-05 20:25 +0200
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 21:16 +0200
            Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 22:41 +0200
        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 10:12 +1000
        Re: How to turn a string into a list of integers? Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-04 20:09 -0600
        Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 12:15 +1000
          Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 14:27 +1000
    Re: How to turn a string into a list of integers? obedrios@gmail.com - 2014-09-03 07:30 -0700

Page 2 of 2 — ← Prev page 1 [2]


#77694

FromChris Angelico <rosuav@gmail.com>
Date2014-09-08 15:53 +1000
Message-ID<mailman.13861.1410155589.18130.python-list@python.org>
In reply to#77693
On Mon, Sep 8, 2014 at 3:44 PM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> The original question was regarding storage - how PEP 393 says that
>> strings will be encoded in memory in any of three ways (Latin-1,
>> UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not
>> what a string *is*, but only what it is made of.
>
> I'm a bit surprised that kind of CPython implementation detail would go
> into a PEP. I had thought PEPs codified Python independently of CPython.
>
> But maybe CPython is to Python what England is to the UK: even the
> government is having a hard time making a distinction.

It is a bit of a tricky one. The PEP governs things about the API that
CPython offers to extensions, so it's part of the public face of the
language - it's not "purely an implementation detail" like, say, the
exact algorithm for expanding a list's capacity as elements get
appended to it.

ChrisA

[toc] | [prev] | [next] | [standalone]


#77696

FromTerry Reedy <tjreedy@udel.edu>
Date2014-09-08 03:41 -0400
Message-ID<mailman.13863.1410162116.18130.python-list@python.org>
In reply to#77693
On 9/8/2014 1:44 AM, Marko Rauhamaa wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> The original question was regarding storage - how PEP 393 says that
>> strings will be encoded in memory in any of three ways (Latin-1,
>> UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not
>> what a string *is*, but only what it is made of.
>
> I'm a bit surprised that kind of CPython implementation detail would go
> into a PEP. I had thought PEPs codified Python independently of CPython.

There are multiple PEP that are not strictly about Python the language: 
meta-peps, informational peps, and others about the cpython api, 
implementation, and distribution issues. 393 is followed by
397  Python launcher for Windows  Hammond, v. Löwis
http://legacy.python.org/dev/peps/

The PEP process is a tool for the core development team and the product 
is documentation of decisions and actions thereby

> But maybe CPython is to Python what England is to the UK: even the
> government is having a hard time making a distinction.

We don't.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#77673

FromChris Angelico <rosuav@gmail.com>
Date2014-09-08 01:04 +1000
Message-ID<mailman.13850.1410102277.18130.python-list@python.org>
In reply to#77666
On Mon, Sep 8, 2014 at 12:52 AM, MRAB <python@mrabarnett.plus.com> wrote:
> I don't think you should be saying that it stores the string in Latin-1
> or UTF-16 because that might suggest that they are encoded. They aren't.

Except that they are. RAM stores bytes [1], so by definition
everything that's in memory is encoded. You can't store a list in
memory; what you store is a set of bits which represent some metadata
and a bunch of pointers. You can't store a non-integer in memory, so
you use some kind of efficient packed system like IEEE 754. You can't
even store an integer without using some kind of encoding, most likely
by packing it into some number of bytes and laying those bytes out
either smallest first or largest first. So yes, CPython 3.3 stores
strings encoded Latin-1, UCS-2 [2], or UCS-4. The Python string *is* a
sequence of characters, but it's *stored* as a sequence of bytes in
one of those encodings. (And other Pythons may not use the same
encodings. MicroPython uses UTF-8 internally, which gives it *very*
different indexing performance.)

ChrisA

[1] On modern systems it stores larger units, probably 64-bit or
128-bit hunks, but whatever. Same difference.
[2] As Steven says, UTF-16 or UCS-2. I prefer the latter name here; as
it (like Latin-1) is restricted in character set rather than variable
in length. But same thing.

[toc] | [prev] | [next] | [standalone]


#77674

FromRoy Smith <roy@panix.com>
Date2014-09-07 11:40 -0400
Message-ID<roy-6B2B20.11402707092014@news.panix.com>
In reply to#77673
In article <mailman.13850.1410102277.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> You can't store a list in memory; what you store is a set of bits 
> which represent some metadata and a bunch of pointers.


Well, technically, what you store is something which has the right 
behavior.  If I wrote:

my_huffman_coded_list = [0] * 1000000

I don't know of anything which requires Python to actually generate a 
million 0's and store them somewhere (even ignoring interning of 
integers).  As long as it generated an object (perhaps a subclass of 
list) which responded to all of list's methods the same way a real list 
would, it could certainly build a more compact representation.

[toc] | [prev] | [next] | [standalone]


#77677

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-08 04:00 +1000
Message-ID<540c9d41$0$24963$c3e8da3$5496439d@news.astraweb.com>
In reply to#77674
Roy Smith wrote:

> In article <mailman.13850.1410102277.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> 
>> You can't store a list in memory; what you store is a set of bits
>> which represent some metadata and a bunch of pointers.
> 
> 
> Well, technically, what you store is something which has the right
> behavior.  If I wrote:

No. Chris is right: what you store is bits, because that is how memory in
modern hardware works. Some old Soviet era mainframes used trits, trinary
digits, instead of bits, but that's just a footnote in the history of
computing. We're talking implementation here, not interface. The
implementation of a list in memory is bits, not trits, nor holograms,
fragments of DNA, or nanometre-sized carbon rods. Some day there may be
computers with implementations not based on bits, but this is not that day.

[Aside: technically, very few if any modern memory chips support direct
access to individual bits, only to words, which these days are usually
8-bit bytes. At the hardware level, the bits may not even be implemented as
individual on/off two-state switches. So technically we should be talking
about bytes rather than bits, since that's the interface which the memory
chip provides. What it does internally is up to the chip designer.]


> my_huffman_coded_list = [0] * 1000000
> 
> I don't know of anything which requires Python to actually generate a
> million 0's and store them somewhere (even ignoring interning of
> integers).  

That's a tricky question. There's not a lot of wiggle-room for what a
compliant implementation of Python can do here, but there is some. The
language requires that:

- the expression must generate a genuine built-in list of length 
  exactly one million;

- even if you monkey-patch builtins and replace list with something
  else, you still get a genuine list, not the monkey-patched version;

- all million items must refer to the same instance;

- regardless of whether that instance is cached (like 0) or not;

- reading and writing to any index must be O(1);

- although I guess it would be acceptable if that was O(1) amortised.

(Not all of these requirements are documented per se, some are taken from
the reference implementation, CPython, and some from comments made by
Guido, e.g. he has stated that he would consider it unacceptable for a
Python implementation to use a linked list instead of an array for lists,
because of the performance implementations.)

Beyond that, an implementation could do anything it likes. If the
implementer can come up with some clever way to have [0]*1000000 use less
memory while not compromising the above, they can do so.


> As long as it generated an object (perhaps a subclass of 
> list) which responded to all of list's methods the same way a real list
> would, it could certainly build a more compact representation.

However a subclass of list would not be acceptable. 


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77690

FromChris Angelico <rosuav@gmail.com>
Date2014-09-08 10:12 +1000
Message-ID<mailman.13858.1410135142.18130.python-list@python.org>
In reply to#77674
On Mon, Sep 8, 2014 at 1:40 AM, Roy Smith <roy@panix.com> wrote:
> Well, technically, what you store is something which has the right
> behavior.  If I wrote:
>
> my_huffman_coded_list = [0] * 1000000
>
> I don't know of anything which requires Python to actually generate a
> million 0's and store them somewhere (even ignoring interning of
> integers).  As long as it generated an object (perhaps a subclass of
> list) which responded to all of list's methods the same way a real list
> would, it could certainly build a more compact representation.

Steven hinted at it, but I'll say one thing more explicitly here:
There's actually something that requires Python to *not* generate a
million 0 integers. What you get is a million references to the *same*
zero.

>>> another_list = [object()] * 1000000
>>> sum(id(x) for x in another_list)
140287290433648000000
>>> id(another_list[0]) * len(another_list)
140287290433648000000

The two figures are guaranteed to be the same, these are all the same object.

But what you're talking about here is an alternative encoding. And
it's definitely possible for different Pythons to encode strings
differently; uPy uses UTF-8 internally, which gives different
performance metrics, but guarantees the same semantics; I could
imagine someone implementing a Python interpreter in Pike, and using
the Pike string type to store Python strings (the semantics will all
be correct, as it's a Unicode string; the most notable difference is
that Pike strings are guaranteed to be interned, so all equality
comparisons are identity checks); if you wanted to, I'm sure you could
build a Python that uses a dictionary of words (added to every time
you create a string, of course), and actually represents entire words
as short integers, which would mean individual characters aren't
necessarily represented directly. But somehow, you have to turn the
concept of "sequence of Unicode characters" into some well-defined
sequence of bytes, and that's an encoding.

ChrisA

[toc] | [prev] | [next] | [standalone]


#77651

FromChris Angelico <rosuav@gmail.com>
Date2014-09-06 22:23 +1000
Message-ID<mailman.13834.1410006208.18130.python-list@python.org>
In reply to#77636
On Sat, Sep 6, 2014 at 10:15 PM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> I understand: narrow build is UCS2, wide build is UCS4
> - In a UCS2 build each character of an Unicode string uses 16 Bits and has
>   code points from U-0000..U-FFFF (0..65535)
> - In a UCS4 build each character of an Unicode string uses 32 Bits and has
>   code points from U-00000000..U-0010FFFF (0..1114111)

Pretty much. Narrow builds are buggy, so as much as possible, you want
to avoid using them. Ideally, use Python 3.3 or newer, where the
distinction doesn't exist - all builds are functionally like wide
builds, with memory usage even better than narrow builds (they'll use
8 bits per character if it's possible).

As a general rule, precompiled Python for Windows is usually a narrow
build, and Python distributions for Linux are usually wide builds. (I
don't know about Mac OS builds.) You can test any Python by checking
out sys.maxunicode - it'll be 65535 on a narrow build, or 1114111 on
wide builds (because that's the maximum codepoint defined by Unicode -
U+10FFFF - as it's the highest number that can be represented in
UTF-16).

ChrisA

[toc] | [prev] | [next] | [standalone]


#77606

FromChris “Kwpolska” Warrick <kwpolska@gmail.com>
Date2014-09-05 20:25 +0200
Message-ID<mailman.13803.1409941524.18130.python-list@python.org>
In reply to#77582

[Multipart message — attachments visible in raw view] — view raw

On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com>
wrote:
> Could someone please explain the following behavior to me:
> Python 2.7.7, MacOS 10.9 Mavericks
>
> >>> import sys
> >>> sys.getdefaultencoding()
> 'ascii'
> >>> [ord(c) for c in 'AÄ']
> [65, 195, 132]
> >>> [ord(c) for c in u'AÄ']
> [65, 196]
>
> My obviously wrong understanding:
> ‚AÄ‘ in ‚ascii‘ are two characters
>      one with ord A=65 and
>      one with ord Ä=196 ISO8859-1 <depends on code table>
>      —-> why [65, 195, 132]
> u’AÄ’ is an Unicode string
>      —-> why [65, 196]
>
> It is just the other way round as I would expect.

Basically, the first string is just a bunch of bytes, as provided by your
terminal — which sounds like UTF-8 (perfectly logical in 2014).  The second
one is converted into a real Unicode representation. The codepoint for Ä is
U+00C4 (196 decimal). It's just a coincidence that it also matches latin1
aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please
kindly forget encodings other than UTF-8.

BTW: ASCII covers only the first 128 bytes.

--
Chris “Kwpolska” Warrick <http://chriswarrick.com/>
Sent from my Galaxy S3.

[toc] | [prev] | [next] | [standalone]


#77610

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2014-09-05 21:16 +0200
Message-ID<mailman.13806.1409944624.18130.python-list@python.org>
In reply to#77582
Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpolska@gmail.com>:
> On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com> wrote:
> > Could someone please explain the following behavior to me:
> > Python 2.7.7, MacOS 10.9 Mavericks
> >
> > >>> import sys
> > >>> sys.getdefaultencoding()
> > 'ascii'
> > >>> [ord(c) for c in 'AÄ']
> > [65, 195, 132]
> > >>> [ord(c) for c in u'AÄ']
> > [65, 196]
> >
> > My obviously wrong understanding:
> > ‚AÄ‘ in ‚ascii‘ are two characters
> >      one with ord A=65 and
> >      one with ord Ä=196 ISO8859-1 <depends on code table>
> >      —-> why [65, 195, 132]
> > u’AÄ’ is an Unicode string
> >      —-> why [65, 196]
> >
> > It is just the other way round as I would expect.
> 
> Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014).  The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8.

So:
‘AÄ’ is an UTF-8 string represented by 3 bytes:
A -> 41   -> 65  first byte decimal
Ä -> c384 -> 195 and 132 second and third byte decimal

u’AÄ’ is an Unicode string represented by 2 bytes?:
A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()?
Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()?


> BTW: ASCII covers only the first 128 bytes.

ACK
-- 
Kurt Mueller, kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#77615

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2014-09-05 22:41 +0200
Message-ID<mailman.13809.1409949686.18130.python-list@python.org>
In reply to#77582
Am 05.09.2014 um 21:16 schrieb Kurt Mueller <kurt.alfred.mueller@gmail.com>:
> Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpolska@gmail.com>:
>> On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com> wrote:
>>> Could someone please explain the following behavior to me:
>>> Python 2.7.7, MacOS 10.9 Mavericks
>>> 
>>>>>> import sys
>>>>>> sys.getdefaultencoding()
>>> 'ascii'
>>>>>> [ord(c) for c in 'AÄ']
>>> [65, 195, 132]
>>>>>> [ord(c) for c in u'AÄ']
>>> [65, 196]
>>> 
>>> My obviously wrong understanding:
>>> ‚AÄ‘ in ‚ascii‘ are two characters
>>>     one with ord A=65 and
>>>     one with ord Ä=196 ISO8859-1 <depends on code table>
>>>     —-> why [65, 195, 132]
>>> u’AÄ’ is an Unicode string
>>>     —-> why [65, 196]
>>> 
>>> It is just the other way round as I would expect.
>> 
>> Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014).  The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8.
> 
> So:
> ‘AÄ’ is an UTF-8 string represented by 3 bytes:
> A -> 41   -> 65  first byte decimal
> Ä -> c384 -> 195 and 132 second and third byte decimal
> 
> u’AÄ’ is an Unicode string represented by 2 bytes?:
> A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()?
> Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()?

After reading the ord() manual:
The second case should read:
u’AÄ’ is an Unicode string represented by 2 unicode characters:
If Python was built with UCS2 Unicode, then the character’s code point must
be in the range [0..65535, 16 bits, U-0000..U-FFFF]
A -> U+0041 ->  65 first  character decimal (code point)
Ä -> U+00C4 -> 196 second character decimal (code point)


Am I right now?

-- 
Kurt Mueller, kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#77564

FromChris Angelico <rosuav@gmail.com>
Date2014-09-05 10:12 +1000
Message-ID<mailman.13777.1409875971.18130.python-list@python.org>
In reply to#77483
On Fri, Sep 5, 2014 at 7:06 AM, Joshua Landau <joshua@landau.ws> wrote:
> On 3 September 2014 15:48,  <cl@isbd.net> wrote:
>> Peter Otten <__peter__@web.de> wrote:
>>> >>> [ord(c) for c in "This is a string"]
>>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
>>>
>>> There are other ways, but you have to describe the use case and your Python
>>> version for us to recommend the most appropriate.
>>>
>> That looks OK to me.  It's just for outputting a string to the block
>> write command in python-smbus which expects an integer array.
>
> Just be careful about Unicode characters.

If it's a Unicode string (which is the default in Python 3), all
Unicode characters will work correctly. If it's a byte string (the
default in Python 2), then you can't actually have any Unicode
characters in it at all, you have bytes; Py2 lets you be a bit sloppy
with the ASCII range, but technically, you still have bytes, not
characters..

ChrisA

[toc] | [prev] | [next] | [standalone]


#77566

FromIan Kelly <ian.g.kelly@gmail.com>
Date2014-09-04 20:09 -0600
Message-ID<mailman.13778.1409882991.18130.python-list@python.org>
In reply to#77483
On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
> If it's a Unicode string (which is the default in Python 3), all
> Unicode characters will work correctly.

Assuming the library that needs this is expecting codepoints and will
accept integers greater than 255.

> If it's a byte string (the
> default in Python 2), then you can't actually have any Unicode
> characters in it at all, you have bytes; Py2 lets you be a bit sloppy
> with the ASCII range, but technically, you still have bytes, not
> characters..

In that case the library will almost certainly accept it, but could be
expecting a different encoding.

[toc] | [prev] | [next] | [standalone]


#77567

FromChris Angelico <rosuav@gmail.com>
Date2014-09-05 12:15 +1000
Message-ID<mailman.13779.1409883321.18130.python-list@python.org>
In reply to#77483
On Fri, Sep 5, 2014 at 12:09 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
>> If it's a Unicode string (which is the default in Python 3), all
>> Unicode characters will work correctly.
>
> Assuming the library that needs this is expecting codepoints and will
> accept integers greater than 255.

They're still valid integers. It's just that someone might not know
how to work with them. Everyone has limits - I don't think repr()
would like to be fed Graham's Number, for instance, but we still say
that it accepts integers :)

>> If it's a byte string (the
>> default in Python 2), then you can't actually have any Unicode
>> characters in it at all, you have bytes; Py2 lets you be a bit sloppy
>> with the ASCII range, but technically, you still have bytes, not
>> characters..
>
> In that case the library will almost certainly accept it, but could be
> expecting a different encoding.

Yeah. Either way, the problem isn't "be careful about Unicode
characters". One option has Unicode characters, the other doesn't, and
you need to know which one it is.

I just don't like people talking about "Unicode characters" being
somehow different from "normal text" or something, and being something
that you need to be careful of. It's not that there are some
characters that behave nicely, and then other ones ("Unicode" ones)
that don't.

ChrisA

[toc] | [prev] | [next] | [standalone]


#77635

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-09-06 14:27 +1000
Message-ID<540a8d1a$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to#77567
Chris Angelico wrote:

> On Fri, Sep 5, 2014 at 12:09 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>> On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
>>> If it's a Unicode string (which is the default in Python 3), all
>>> Unicode characters will work correctly.
>>
>> Assuming the library that needs this is expecting codepoints and will
>> accept integers greater than 255.
> 
> They're still valid integers. It's just that someone might not know
> how to work with them. Everyone has limits - I don't think repr()
> would like to be fed Graham's Number, for instance, but we still say
> that it accepts integers :)

If you can fit Graham's Number into memory, repr() will happily deal with
it. Although, it might take a while to print out...

[...]
> I just don't like people talking about "Unicode characters" being
> somehow different from "normal text" or something, and being something
> that you need to be careful of. It's not that there are some
> characters that behave nicely, and then other ones ("Unicode" ones)
> that don't.

"Behave nicely" depends on what behaviour you're expecting.

There is a sense in which Unicode is different from ASCII text. ASCII is a 7
bit character set. In principle, you could have different implementations
of ASCII but in practice it's been so long since any machine you're likely
to come across uses anything but exactly a single 8-bit byte for each ASCII
character that we might as well say that ASCII has a single implementation:

* 1 byte code units, fixed width characters 

That is, every character takes exactly one 8-bit byte.

(Reminder: "byte" does not necessarily mean 8 bits.)

Unicode, on the other hand, has *at least* nine different implementations
which you are *likely* to come across:

* UTF-8 has 1-byte code units, variable width characters: every character
takes between 1 and 4 bytes;

* UTF-8 with a so-called "Byte Order Mark" at the beginning of the file;

* UTF-16-BE has 2-byte code units, variable width characters: every
character takes either 2 or 4 bytes;

* UTF-16-LE is the same, but the bytes are in opposite order;

* UTF-16 with a Byte Order Mark at the beginning of the file;

* UTF-32-BE has 4-byte code units, fixed width characters; every character
takes exactly 4 bytes;

* UTF-32-LE is the same, but the bytes are in opposite order;

* UTF-32 with a Byte Order Mark at the beginning of the file;

* UCS-2 is a subset of Unicode with 2-byte code units, fixed width
characters; every character takes exactly 2 bytes (UCS-2 is effectively
UTF-16-BE for characters in the Basic Multilingual Plane).

Plus various more obscure or exotic encodings.

So, while it is not *strictly* correct to say that ASCII character 'A' is
always the eight bits 01000001, the exceptions are so rare that there might
as well not be any. But the Unicode character 'A' could be:

01000001
01000001 00000000
00000000 01000001
01000001 00000000 00000000 00000000
00000000 00000000 00000000 01000001


and possibly more.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#77481

Fromobedrios@gmail.com
Date2014-09-03 07:30 -0700
Message-ID<aea5e00d-1490-4e8b-b04c-6be052fea213@googlegroups.com>
In reply to#77479
El miércoles, 3 de septiembre de 2014 05:27:29 UTC-7, c...@isbd.net  escribió:
> I know I can get a list of the characters in a string by simply doing:-
> 
> 
> 
>     listOfCharacters = list("This is a string")
> 
> 
> 
> ... but how do I get a list of integers?
> 
> 
> 
> -- 
> 
> Chris Green
> 
> ·

You Can Apply either, a map function or a list comprehension as follow:

Using Map:
>>> list(map(ord, listOfCharacters))
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]

Using List Comprehension:

>>> [ord(n) for n in listOfCharacters]
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]

Very Best Regards

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web