Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #77479 > unrolled thread
| Started by | cl@isbd.net |
|---|---|
| First post | 2014-09-03 13:27 +0100 |
| Last post | 2014-09-03 07:30 -0700 |
| Articles | 15 on this page of 35 — 14 participants |
Back to article view | Back to comp.lang.python
How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 13:27 +0100
Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-03 14:52 +0200
Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-03 15:48 +0100
Re: How to turn a string into a list of integers? Joshua Landau <joshua@landau.ws> - 2014-09-04 22:06 +0100
Re: How to turn a string into a list of integers? cl@isbd.net - 2014-09-05 09:42 +0100
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 19:56 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 15:47 +1000
Re: How to turn a string into a list of integers? Peter Otten <__peter__@web.de> - 2014-09-06 10:22 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 21:17 +1000
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 14:15 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 04:19 +1000
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-06 21:28 +0200
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-07 11:47 +1000
Re: How to turn a string into a list of integers? MRAB <python@mrabarnett.plus.com> - 2014-09-07 15:52 +0100
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 03:02 +1000
Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 10:53 -0700
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:08 +1000
Re: How to turn a string into a list of integers? Rustom Mody <rustompmody@gmail.com> - 2014-09-07 11:34 -0700
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:14 +1000
Re: How to turn a string into a list of integers? Marko Rauhamaa <marko@pacujo.net> - 2014-09-08 08:44 +0300
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 15:53 +1000
Re: How to turn a string into a list of integers? Terry Reedy <tjreedy@udel.edu> - 2014-09-08 03:41 -0400
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 01:04 +1000
Re: How to turn a string into a list of integers? Roy Smith <roy@panix.com> - 2014-09-07 11:40 -0400
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-08 04:00 +1000
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-08 10:12 +1000
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-06 22:23 +1000
Re: How to turn a string into a list of integers? Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2014-09-05 20:25 +0200
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 21:16 +0200
Re: How to turn a string into a list of integers? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2014-09-05 22:41 +0200
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 10:12 +1000
Re: How to turn a string into a list of integers? Ian Kelly <ian.g.kelly@gmail.com> - 2014-09-04 20:09 -0600
Re: How to turn a string into a list of integers? Chris Angelico <rosuav@gmail.com> - 2014-09-05 12:15 +1000
Re: How to turn a string into a list of integers? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-09-06 14:27 +1000
Re: How to turn a string into a list of integers? obedrios@gmail.com - 2014-09-03 07:30 -0700
Page 2 of 2 — ← Prev page 1 [2]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-08 15:53 +1000 |
| Message-ID | <mailman.13861.1410155589.18130.python-list@python.org> |
| In reply to | #77693 |
On Mon, Sep 8, 2014 at 3:44 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > Chris Angelico <rosuav@gmail.com>: > >> The original question was regarding storage - how PEP 393 says that >> strings will be encoded in memory in any of three ways (Latin-1, >> UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not >> what a string *is*, but only what it is made of. > > I'm a bit surprised that kind of CPython implementation detail would go > into a PEP. I had thought PEPs codified Python independently of CPython. > > But maybe CPython is to Python what England is to the UK: even the > government is having a hard time making a distinction. It is a bit of a tricky one. The PEP governs things about the API that CPython offers to extensions, so it's part of the public face of the language - it's not "purely an implementation detail" like, say, the exact algorithm for expanding a list's capacity as elements get appended to it. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-09-08 03:41 -0400 |
| Message-ID | <mailman.13863.1410162116.18130.python-list@python.org> |
| In reply to | #77693 |
On 9/8/2014 1:44 AM, Marko Rauhamaa wrote: > Chris Angelico <rosuav@gmail.com>: > >> The original question was regarding storage - how PEP 393 says that >> strings will be encoded in memory in any of three ways (Latin-1, >> UCS-2/UTF-16, or UCS-4/UTF-32). But even in our world, that is not >> what a string *is*, but only what it is made of. > > I'm a bit surprised that kind of CPython implementation detail would go > into a PEP. I had thought PEPs codified Python independently of CPython. There are multiple PEP that are not strictly about Python the language: meta-peps, informational peps, and others about the cpython api, implementation, and distribution issues. 393 is followed by 397 Python launcher for Windows Hammond, v. Löwis http://legacy.python.org/dev/peps/ The PEP process is a tool for the core development team and the product is documentation of decisions and actions thereby > But maybe CPython is to Python what England is to the UK: even the > government is having a hard time making a distinction. We don't. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-08 01:04 +1000 |
| Message-ID | <mailman.13850.1410102277.18130.python-list@python.org> |
| In reply to | #77666 |
On Mon, Sep 8, 2014 at 12:52 AM, MRAB <python@mrabarnett.plus.com> wrote: > I don't think you should be saying that it stores the string in Latin-1 > or UTF-16 because that might suggest that they are encoded. They aren't. Except that they are. RAM stores bytes [1], so by definition everything that's in memory is encoded. You can't store a list in memory; what you store is a set of bits which represent some metadata and a bunch of pointers. You can't store a non-integer in memory, so you use some kind of efficient packed system like IEEE 754. You can't even store an integer without using some kind of encoding, most likely by packing it into some number of bytes and laying those bytes out either smallest first or largest first. So yes, CPython 3.3 stores strings encoded Latin-1, UCS-2 [2], or UCS-4. The Python string *is* a sequence of characters, but it's *stored* as a sequence of bytes in one of those encodings. (And other Pythons may not use the same encodings. MicroPython uses UTF-8 internally, which gives it *very* different indexing performance.) ChrisA [1] On modern systems it stores larger units, probably 64-bit or 128-bit hunks, but whatever. Same difference. [2] As Steven says, UTF-16 or UCS-2. I prefer the latter name here; as it (like Latin-1) is restricted in character set rather than variable in length. But same thing.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-09-07 11:40 -0400 |
| Message-ID | <roy-6B2B20.11402707092014@news.panix.com> |
| In reply to | #77673 |
In article <mailman.13850.1410102277.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > You can't store a list in memory; what you store is a set of bits > which represent some metadata and a bunch of pointers. Well, technically, what you store is something which has the right behavior. If I wrote: my_huffman_coded_list = [0] * 1000000 I don't know of anything which requires Python to actually generate a million 0's and store them somewhere (even ignoring interning of integers). As long as it generated an object (perhaps a subclass of list) which responded to all of list's methods the same way a real list would, it could certainly build a more compact representation.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-08 04:00 +1000 |
| Message-ID | <540c9d41$0$24963$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77674 |
Roy Smith wrote: > In article <mailman.13850.1410102277.18130.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> You can't store a list in memory; what you store is a set of bits >> which represent some metadata and a bunch of pointers. > > > Well, technically, what you store is something which has the right > behavior. If I wrote: No. Chris is right: what you store is bits, because that is how memory in modern hardware works. Some old Soviet era mainframes used trits, trinary digits, instead of bits, but that's just a footnote in the history of computing. We're talking implementation here, not interface. The implementation of a list in memory is bits, not trits, nor holograms, fragments of DNA, or nanometre-sized carbon rods. Some day there may be computers with implementations not based on bits, but this is not that day. [Aside: technically, very few if any modern memory chips support direct access to individual bits, only to words, which these days are usually 8-bit bytes. At the hardware level, the bits may not even be implemented as individual on/off two-state switches. So technically we should be talking about bytes rather than bits, since that's the interface which the memory chip provides. What it does internally is up to the chip designer.] > my_huffman_coded_list = [0] * 1000000 > > I don't know of anything which requires Python to actually generate a > million 0's and store them somewhere (even ignoring interning of > integers). That's a tricky question. There's not a lot of wiggle-room for what a compliant implementation of Python can do here, but there is some. The language requires that: - the expression must generate a genuine built-in list of length exactly one million; - even if you monkey-patch builtins and replace list with something else, you still get a genuine list, not the monkey-patched version; - all million items must refer to the same instance; - regardless of whether that instance is cached (like 0) or not; - reading and writing to any index must be O(1); - although I guess it would be acceptable if that was O(1) amortised. (Not all of these requirements are documented per se, some are taken from the reference implementation, CPython, and some from comments made by Guido, e.g. he has stated that he would consider it unacceptable for a Python implementation to use a linked list instead of an array for lists, because of the performance implementations.) Beyond that, an implementation could do anything it likes. If the implementer can come up with some clever way to have [0]*1000000 use less memory while not compromising the above, they can do so. > As long as it generated an object (perhaps a subclass of > list) which responded to all of list's methods the same way a real list > would, it could certainly build a more compact representation. However a subclass of list would not be acceptable. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-08 10:12 +1000 |
| Message-ID | <mailman.13858.1410135142.18130.python-list@python.org> |
| In reply to | #77674 |
On Mon, Sep 8, 2014 at 1:40 AM, Roy Smith <roy@panix.com> wrote: > Well, technically, what you store is something which has the right > behavior. If I wrote: > > my_huffman_coded_list = [0] * 1000000 > > I don't know of anything which requires Python to actually generate a > million 0's and store them somewhere (even ignoring interning of > integers). As long as it generated an object (perhaps a subclass of > list) which responded to all of list's methods the same way a real list > would, it could certainly build a more compact representation. Steven hinted at it, but I'll say one thing more explicitly here: There's actually something that requires Python to *not* generate a million 0 integers. What you get is a million references to the *same* zero. >>> another_list = [object()] * 1000000 >>> sum(id(x) for x in another_list) 140287290433648000000 >>> id(another_list[0]) * len(another_list) 140287290433648000000 The two figures are guaranteed to be the same, these are all the same object. But what you're talking about here is an alternative encoding. And it's definitely possible for different Pythons to encode strings differently; uPy uses UTF-8 internally, which gives different performance metrics, but guarantees the same semantics; I could imagine someone implementing a Python interpreter in Pike, and using the Pike string type to store Python strings (the semantics will all be correct, as it's a Unicode string; the most notable difference is that Pike strings are guaranteed to be interned, so all equality comparisons are identity checks); if you wanted to, I'm sure you could build a Python that uses a dictionary of words (added to every time you create a string, of course), and actually represents entire words as short integers, which would mean individual characters aren't necessarily represented directly. But somehow, you have to turn the concept of "sequence of Unicode characters" into some well-defined sequence of bytes, and that's an encoding. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-06 22:23 +1000 |
| Message-ID | <mailman.13834.1410006208.18130.python-list@python.org> |
| In reply to | #77636 |
On Sat, Sep 6, 2014 at 10:15 PM, Kurt Mueller <kurt.alfred.mueller@gmail.com> wrote: > I understand: narrow build is UCS2, wide build is UCS4 > - In a UCS2 build each character of an Unicode string uses 16 Bits and has > code points from U-0000..U-FFFF (0..65535) > - In a UCS4 build each character of an Unicode string uses 32 Bits and has > code points from U-00000000..U-0010FFFF (0..1114111) Pretty much. Narrow builds are buggy, so as much as possible, you want to avoid using them. Ideally, use Python 3.3 or newer, where the distinction doesn't exist - all builds are functionally like wide builds, with memory usage even better than narrow builds (they'll use 8 bits per character if it's possible). As a general rule, precompiled Python for Windows is usually a narrow build, and Python distributions for Linux are usually wide builds. (I don't know about Mac OS builds.) You can test any Python by checking out sys.maxunicode - it'll be 65535 on a narrow build, or 1114111 on wide builds (because that's the maximum codepoint defined by Unicode - U+10FFFF - as it's the highest number that can be represented in UTF-16). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris “Kwpolska” Warrick <kwpolska@gmail.com> |
|---|---|
| Date | 2014-09-05 20:25 +0200 |
| Message-ID | <mailman.13803.1409941524.18130.python-list@python.org> |
| In reply to | #77582 |
[Multipart message — attachments visible in raw view] — view raw
On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com> wrote: > Could someone please explain the following behavior to me: > Python 2.7.7, MacOS 10.9 Mavericks > > >>> import sys > >>> sys.getdefaultencoding() > 'ascii' > >>> [ord(c) for c in 'AÄ'] > [65, 195, 132] > >>> [ord(c) for c in u'AÄ'] > [65, 196] > > My obviously wrong understanding: > ‚AÄ‘ in ‚ascii‘ are two characters > one with ord A=65 and > one with ord Ä=196 ISO8859-1 <depends on code table> > —-> why [65, 195, 132] > u’AÄ’ is an Unicode string > —-> why [65, 196] > > It is just the other way round as I would expect. Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014). The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8. BTW: ASCII covers only the first 128 bytes. -- Chris “Kwpolska” Warrick <http://chriswarrick.com/> Sent from my Galaxy S3.
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2014-09-05 21:16 +0200 |
| Message-ID | <mailman.13806.1409944624.18130.python-list@python.org> |
| In reply to | #77582 |
Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpolska@gmail.com>: > On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com> wrote: > > Could someone please explain the following behavior to me: > > Python 2.7.7, MacOS 10.9 Mavericks > > > > >>> import sys > > >>> sys.getdefaultencoding() > > 'ascii' > > >>> [ord(c) for c in 'AÄ'] > > [65, 195, 132] > > >>> [ord(c) for c in u'AÄ'] > > [65, 196] > > > > My obviously wrong understanding: > > ‚AÄ‘ in ‚ascii‘ are two characters > > one with ord A=65 and > > one with ord Ä=196 ISO8859-1 <depends on code table> > > —-> why [65, 195, 132] > > u’AÄ’ is an Unicode string > > —-> why [65, 196] > > > > It is just the other way round as I would expect. > > Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014). The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8. So: ‘AÄ’ is an UTF-8 string represented by 3 bytes: A -> 41 -> 65 first byte decimal Ä -> c384 -> 195 and 132 second and third byte decimal u’AÄ’ is an Unicode string represented by 2 bytes?: A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()? Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()? > BTW: ASCII covers only the first 128 bytes. ACK -- Kurt Mueller, kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2014-09-05 22:41 +0200 |
| Message-ID | <mailman.13809.1409949686.18130.python-list@python.org> |
| In reply to | #77582 |
Am 05.09.2014 um 21:16 schrieb Kurt Mueller <kurt.alfred.mueller@gmail.com>: > Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpolska@gmail.com>: >> On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller@gmail.com> wrote: >>> Could someone please explain the following behavior to me: >>> Python 2.7.7, MacOS 10.9 Mavericks >>> >>>>>> import sys >>>>>> sys.getdefaultencoding() >>> 'ascii' >>>>>> [ord(c) for c in 'AÄ'] >>> [65, 195, 132] >>>>>> [ord(c) for c in u'AÄ'] >>> [65, 196] >>> >>> My obviously wrong understanding: >>> ‚AÄ‘ in ‚ascii‘ are two characters >>> one with ord A=65 and >>> one with ord Ä=196 ISO8859-1 <depends on code table> >>> —-> why [65, 195, 132] >>> u’AÄ’ is an Unicode string >>> —-> why [65, 196] >>> >>> It is just the other way round as I would expect. >> >> Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014). The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8. > > So: > ‘AÄ’ is an UTF-8 string represented by 3 bytes: > A -> 41 -> 65 first byte decimal > Ä -> c384 -> 195 and 132 second and third byte decimal > > u’AÄ’ is an Unicode string represented by 2 bytes?: > A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()? > Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()? After reading the ord() manual: The second case should read: u’AÄ’ is an Unicode string represented by 2 unicode characters: If Python was built with UCS2 Unicode, then the character’s code point must be in the range [0..65535, 16 bits, U-0000..U-FFFF] A -> U+0041 -> 65 first character decimal (code point) Ä -> U+00C4 -> 196 second character decimal (code point) Am I right now? -- Kurt Mueller, kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-05 10:12 +1000 |
| Message-ID | <mailman.13777.1409875971.18130.python-list@python.org> |
| In reply to | #77483 |
On Fri, Sep 5, 2014 at 7:06 AM, Joshua Landau <joshua@landau.ws> wrote: > On 3 September 2014 15:48, <cl@isbd.net> wrote: >> Peter Otten <__peter__@web.de> wrote: >>> >>> [ord(c) for c in "This is a string"] >>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103] >>> >>> There are other ways, but you have to describe the use case and your Python >>> version for us to recommend the most appropriate. >>> >> That looks OK to me. It's just for outputting a string to the block >> write command in python-smbus which expects an integer array. > > Just be careful about Unicode characters. If it's a Unicode string (which is the default in Python 3), all Unicode characters will work correctly. If it's a byte string (the default in Python 2), then you can't actually have any Unicode characters in it at all, you have bytes; Py2 lets you be a bit sloppy with the ASCII range, but technically, you still have bytes, not characters.. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-09-04 20:09 -0600 |
| Message-ID | <mailman.13778.1409882991.18130.python-list@python.org> |
| In reply to | #77483 |
On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote: > If it's a Unicode string (which is the default in Python 3), all > Unicode characters will work correctly. Assuming the library that needs this is expecting codepoints and will accept integers greater than 255. > If it's a byte string (the > default in Python 2), then you can't actually have any Unicode > characters in it at all, you have bytes; Py2 lets you be a bit sloppy > with the ASCII range, but technically, you still have bytes, not > characters.. In that case the library will almost certainly accept it, but could be expecting a different encoding.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-09-05 12:15 +1000 |
| Message-ID | <mailman.13779.1409883321.18130.python-list@python.org> |
| In reply to | #77483 |
On Fri, Sep 5, 2014 at 12:09 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
>> If it's a Unicode string (which is the default in Python 3), all
>> Unicode characters will work correctly.
>
> Assuming the library that needs this is expecting codepoints and will
> accept integers greater than 255.
They're still valid integers. It's just that someone might not know
how to work with them. Everyone has limits - I don't think repr()
would like to be fed Graham's Number, for instance, but we still say
that it accepts integers :)
>> If it's a byte string (the
>> default in Python 2), then you can't actually have any Unicode
>> characters in it at all, you have bytes; Py2 lets you be a bit sloppy
>> with the ASCII range, but technically, you still have bytes, not
>> characters..
>
> In that case the library will almost certainly accept it, but could be
> expecting a different encoding.
Yeah. Either way, the problem isn't "be careful about Unicode
characters". One option has Unicode characters, the other doesn't, and
you need to know which one it is.
I just don't like people talking about "Unicode characters" being
somehow different from "normal text" or something, and being something
that you need to be careful of. It's not that there are some
characters that behave nicely, and then other ones ("Unicode" ones)
that don't.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-09-06 14:27 +1000 |
| Message-ID | <540a8d1a$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #77567 |
Chris Angelico wrote:
> On Fri, Sep 5, 2014 at 12:09 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>> On Thu, Sep 4, 2014 at 6:12 PM, Chris Angelico <rosuav@gmail.com> wrote:
>>> If it's a Unicode string (which is the default in Python 3), all
>>> Unicode characters will work correctly.
>>
>> Assuming the library that needs this is expecting codepoints and will
>> accept integers greater than 255.
>
> They're still valid integers. It's just that someone might not know
> how to work with them. Everyone has limits - I don't think repr()
> would like to be fed Graham's Number, for instance, but we still say
> that it accepts integers :)
If you can fit Graham's Number into memory, repr() will happily deal with
it. Although, it might take a while to print out...
[...]
> I just don't like people talking about "Unicode characters" being
> somehow different from "normal text" or something, and being something
> that you need to be careful of. It's not that there are some
> characters that behave nicely, and then other ones ("Unicode" ones)
> that don't.
"Behave nicely" depends on what behaviour you're expecting.
There is a sense in which Unicode is different from ASCII text. ASCII is a 7
bit character set. In principle, you could have different implementations
of ASCII but in practice it's been so long since any machine you're likely
to come across uses anything but exactly a single 8-bit byte for each ASCII
character that we might as well say that ASCII has a single implementation:
* 1 byte code units, fixed width characters
That is, every character takes exactly one 8-bit byte.
(Reminder: "byte" does not necessarily mean 8 bits.)
Unicode, on the other hand, has *at least* nine different implementations
which you are *likely* to come across:
* UTF-8 has 1-byte code units, variable width characters: every character
takes between 1 and 4 bytes;
* UTF-8 with a so-called "Byte Order Mark" at the beginning of the file;
* UTF-16-BE has 2-byte code units, variable width characters: every
character takes either 2 or 4 bytes;
* UTF-16-LE is the same, but the bytes are in opposite order;
* UTF-16 with a Byte Order Mark at the beginning of the file;
* UTF-32-BE has 4-byte code units, fixed width characters; every character
takes exactly 4 bytes;
* UTF-32-LE is the same, but the bytes are in opposite order;
* UTF-32 with a Byte Order Mark at the beginning of the file;
* UCS-2 is a subset of Unicode with 2-byte code units, fixed width
characters; every character takes exactly 2 bytes (UCS-2 is effectively
UTF-16-BE for characters in the Basic Multilingual Plane).
Plus various more obscure or exotic encodings.
So, while it is not *strictly* correct to say that ASCII character 'A' is
always the eight bits 01000001, the exceptions are so rare that there might
as well not be any. But the Unicode character 'A' could be:
01000001
01000001 00000000
00000000 01000001
01000001 00000000 00000000 00000000
00000000 00000000 00000000 01000001
and possibly more.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | obedrios@gmail.com |
|---|---|
| Date | 2014-09-03 07:30 -0700 |
| Message-ID | <aea5e00d-1490-4e8b-b04c-6be052fea213@googlegroups.com> |
| In reply to | #77479 |
El miércoles, 3 de septiembre de 2014 05:27:29 UTC-7, c...@isbd.net escribió:
> I know I can get a list of the characters in a string by simply doing:-
>
>
>
> listOfCharacters = list("This is a string")
>
>
>
> ... but how do I get a list of integers?
>
>
>
> --
>
> Chris Green
>
> ·
You Can Apply either, a map function or a list comprehension as follow:
Using Map:
>>> list(map(ord, listOfCharacters))
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
Using List Comprehension:
>>> [ord(n) for n in listOfCharacters]
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103]
Very Best Regards
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web