Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #72553 > unrolled thread
| Started by | Paul Sokolovsky <pmiscml@gmail.com> |
|---|---|
| First post | 2014-06-04 00:41 +0300 |
| Last post | 2014-06-04 17:10 +1000 |
| Articles | 20 on this page of 35 — 15 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Sokolovsky <pmiscml@gmail.com> - 2014-06-04 00:41 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 20:37 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 13:52 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 21:40 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:02 -0600
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:16 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 07:42 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 00:58 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 11:06 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 06:01 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 14:57 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:25 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 11:25 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 12:53 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 15:17 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 13:31 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:51 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 00:32 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 02:13 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:21 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Travis Griggs <travisgriggs@gmail.com> - 2014-06-06 09:59 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-06 13:29 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-06 21:20 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 12:27 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:20 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-04 10:00 +0200
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-04 14:42 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-04 19:06 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-05 09:59 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:33 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 05:20 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 22:36 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:55 -0600
Re: Micro Python -- a lean and efficient implementation of Python 3 Terry Reedy <tjreedy@udel.edu> - 2014-06-04 03:00 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:10 +1000
Page 1 of 2 [1] 2 Next page →
| From | Paul Sokolovsky <pmiscml@gmail.com> |
|---|---|
| Date | 2014-06-04 00:41 +0300 |
| Subject | Re: Micro Python -- a lean and efficient implementation of Python 3 |
| Message-ID | <mailman.10646.1401831682.18130.python-list@python.org> |
Hello,
On Wed, 4 Jun 2014 03:08:57 +1000
Chris Angelico <rosuav@gmail.com> wrote:
[]
> With that encouragement, I just cloned your repo and built it on amd64
> Debian Wheezy. Works just fine! Except... I've just found one fairly
> major problem with your support of Python 3.x syntax. Your str type is
> documented as not supporting Unicode. Is that a current flaw that
> you're planning to remove, or a design limitation? Either way, I'm a
> bit dubious about a purported version 1 that doesn't do one of the
> things that Py3 is especially good at - matched by very few languages
> in its encouragement of best practice with Unicode support.
I should start with saying that it's MicroPython what made me look at
Python3. So for me, it already did lot of boon by getting me from under
the rock, so now instead of "at my job, we use python 2.x" I may report
"at my job, we don't wait when our distro will kick us in the ass, and
add 'from __future__ import print_function' whenever we touch some
code".
With that in mind, I, as many others, think that forcing Unicode bloat
upon people by default is the most controversial feature of Python3.
The reason is that you go very long way dealing with languages of the
people of the world by just treating strings as consisting of 8-bit
data. I'd say, that's enough for 90% of applications. Unicode is needed
only if one needs to deal with multiple languages *at the same time*,
which is fairly rare (remaining 10% of apps).
And please keep in mind that MicroPython was originally intended (and
should be remain scalable down to) an MCU. Unicode needed there is even
less, and even less resources to support Unicode just because.
>
> What is your str type actually able to support? It seems to store
> non-ASCII bytes in it, which I presume are supposed to represent the
> rest of Latin-1, but I wasn't able to print them out:
There's a work-in-progress on documenting differences between CPython
and MicroPython at
https://github.com/micropython/micropython/wiki/Differences, it gives
following account on this:
"No unicode support is actually implemented. Python3 calls for strict
difference between str and bytes data types (unlike Python2, which has
neutral unified data type for strings and binary data, and separates
out unicode data type). MicroPython faithfully implements str/bytes
separation, but currently, underlying str implementation is the same as
bytes. This means strings in MicroPython are not unicode, but 8-bit
characters (fully binary-clean)."
>
> Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
> >>> print("asdf\xfdqwer")
>
> Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
> [GCC 4.7.2] on linux
> >>> print("asdf\xfdqwer")
> asdfýqwer
>
> In fact, printing seems to work with bytes:
>
> >>> print("asdf\xc3\xbdqwer")
> asdfýqwer
>
> (my terminal uses UTF-8, this is the UTF-8 encoding of the above
> string)
>
> I would strongly recommend either implementing all of PEP 393, or at
> least making it very clear that this pretends everything is bytes -
> and possibly disallowing any codepoint >127 in any string, which will
> at least mean you're safe on all ASCII-compatible encodings.
MicroPython is not the first "tiny" Python implementation. What differs
MicroPython is that it's neither aim or motto to be a subset of
language. And yet, it's not CPython rewrite either. So, while Unicode
support is surely possible, it's unlikely to be done as "all of
PEPxxx". If you ask me, I'd personally envision it to be implemented as
UTF-8 (in this regard I agree with (or take an influence from)
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/). But I don't have plans
to work on Unicode any time soon - applications I envision for
MicroPython so far fit in those 90% that live happily without Unicode.
But generally, there's no strict roadmap for MicroPython features.
While core of the language (parser, compiler, VM) is developed by
Damien, many other features were already contributed by the community
(project went open-source at the beginning of the year). So, if someone
will want to see Unicode support up to the level of providing patches,
it gladly will be accepted. The only thing we established is that we
want to be able to scale down, and thus almost all features should be
configurable.
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
--
Best regards,
Paul mailto:pmiscml@gmail.com
[toc] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-03 20:37 -0700 |
| Message-ID | <44acd692-5dcd-4e5f-8238-7fbe0de4db2a@googlegroups.com> |
| In reply to | #72553 |
On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: > With that in mind, I, as many others, think that forcing Unicode bloat > upon people by default is the most controversial feature of Python3. > The reason is that you go very long way dealing with languages of the > people of the world by just treating strings as consisting of 8-bit > data. I'd say, that's enough for 90% of applications. Unicode is needed > only if one needs to deal with multiple languages *at the same time*, > which is fairly rare (remaining 10% of apps). > And please keep in mind that MicroPython was originally intended (and > should be remain scalable down to) an MCU. Unicode needed there is even > less, and even less resources to support Unicode just because. At some time (when jmf was making more intelligible noises) I had suggested that the choice between 1/2/4 byte strings that happens at runtime in python3's FSR can be made at python-start time with a command-line switch. There are many combinations here; here is one in more detail: Instead of having one (FSR) string engine, you have (upto) 4 - a pure 1 byte (ASCII) - a pure 2 byte (BMP) with decode-failures for out-of-ranges - a pure 4 byte -- everything UTF-32 - FSR dynamic switching at runtime (with massive moping from the world's jmfs) The point is that only one of these engines would be brought into memory based on command-line/config options. Some more personal thoughts (that may be quite ill-informed!): 1. I regard myself as a unicode ignoramus+enthusiast. The world will be a better place if unicode is more pervasive. See http://blog.languager.org/2014/04/unicoded-python.html As it happens I am also a computer scientist -- I understand that in contexts where anything other than 8-bit chars is unacceptably inefficient, unicode-bloat may be a real thing. 2. My casual/cursory reading of the contents of the SMP-planes suggests that the stuff there is are things like - egyptian hieroplyphics - mahjong characters - ancient greek musical symbols - alchemical symbols etc etc. IOW from pov of a universallly acceptable character set this is mostly rubbish And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there]
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-04 13:52 +1000 |
| Message-ID | <mailman.10673.1401853976.18130.python-list@python.org> |
| In reply to | #72582 |
On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody <rustompmody@gmail.com> wrote: > 2. My casual/cursory reading of the contents of the SMP-planes > suggests that the stuff there is are things like > - egyptian hieroplyphics > - mahjong characters > - ancient greek musical symbols > - alchemical symbols etc etc. > > IOW from pov of a universallly acceptable character set this is mostly > rubbish > > And so a pure BMP-supporting implementation may be a reasonable > compromise. [As long as no surrogate-pairs are there] Not if you're working on the internet. There are several critical groups of characters that aren't in the BMP, such as: 1) Most or all Chinese and Japanese characters 2) Heaps of emoticons and fancy letters 3) Mathematical symbols You can't ignore those. You might be able to say "Well, my program will run slower if you throw these at it", but if you're going down that route, you probably want the full FSR and the advantages it confers on ASCII and Latin-1 strings. Binding your program to BMP-only is nearly as dangerous as binding it to ASCII-only; potentially worse, because you can run an awful lot of artificial tests without remembering to stick in some astral characters. It's not rubbish. It's important stuff that you need to deal with. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-03 21:40 -0700 |
| Message-ID | <c04434ce-cbc4-49ab-b312-24f1631dd894@googlegroups.com> |
| In reply to | #72583 |
On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: > On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: > > And so a pure BMP-supporting implementation may be a reasonable > > compromise. [As long as no surrogate-pairs are there] > Not if you're working on the internet. There are several critical > groups of characters that aren't in the BMP, such as: Of course. But what has the internet to do with micropython? This is their stated goal: | Micro Python is a lean and fast implementation of the Python | programming language (python.org) that is optimised to run on a | microcontroller. > 1) Most or all Chinese and Japanese characters Dont know how you count 'most' | One possible rationale is the desire to limit the size of the full | Unicode character set, where CJK characters as represented by discrete | ideograms may approach or exceed 100,000 (while those required for | ordinary literacy in any language are probably under 3,000). Version 1 | of Unicode was designed to fit into 16 bits and only 20,940 characters | (32%) out of the possible 65,536 were reserved for these CJK Unified | Ideographs. Later Unicode has been extended to 21 bits allowing many | more CJK characters (75,960 are assigned, with room for more). | From http://en.wikipedia.org/wiki/Han_unification
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-06-03 23:02 -0600 |
| Message-ID | <mailman.10677.1401858199.18130.python-list@python.org> |
| In reply to | #72588 |
On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody <rustompmody@gmail.com> wrote: >> 1) Most or all Chinese and Japanese characters > > Dont know how you count 'most' > > | One possible rationale is the desire to limit the size of the full > | Unicode character set, where CJK characters as represented by discrete > | ideograms may approach or exceed 100,000 (while those required for > | ordinary literacy in any language are probably under 3,000). Version 1 > | of Unicode was designed to fit into 16 bits and only 20,940 characters > | (32%) out of the possible 65,536 were reserved for these CJK Unified > | Ideographs. Later Unicode has been extended to 21 bits allowing many > | more CJK characters (75,960 are assigned, with room for more). > > | From http://en.wikipedia.org/wiki/Han_unification So there are 20,940 CJK characters in the BMP, and approximately 55,000 more in the SIP. I'd count 55,000 out of 75,960 as "most".
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-04 17:16 +1000 |
| Message-ID | <mailman.10684.1401866176.18130.python-list@python.org> |
| In reply to | #72588 |
On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody <rustompmody@gmail.com> wrote: > On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: >> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: >> > And so a pure BMP-supporting implementation may be a reasonable >> > compromise. [As long as no surrogate-pairs are there] > >> Not if you're working on the internet. There are several critical >> groups of characters that aren't in the BMP, such as: > > Of course. But what has the internet to do with micropython? Earlier you said: > IOW from pov of a universallly acceptable character set this is mostly > rubbish "Universally acceptable character set" and microcontrollers may well not meet, but if you're talking about universality, you need Unicode. It's that simple. Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, but even if there is, I would expect a Python implementation on it to expose Unicode codepoints in its strings. (Most of the time you won't even be aware of the exact codepoint values. It's only when you put \xNN or \uNNNN or U000NNNNN escapes into your strings, or explicitly use ord/chr or equivalent, that it'd make a difference.) The point is not that you might be able to get away with sticking your head in the sand and wishing Unicode would just go away. Even if you can, it's not something Python 3 can ever do. And I don't think anybody can, anyway. If your device is big enough to hold Python, it should be big enough to handle Unicode; and then you don't have to say "Oh, sorry rest-of-the-world, this only works in English... and only a subset of English... and stuff". ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-06-04 07:42 +0000 |
| Message-ID | <538ecdef$0$11109$c3e8da3@news.astraweb.com> |
| In reply to | #72606 |
On Wed, 04 Jun 2014 17:16:13 +1000, Chris Angelico wrote: > On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody <rustompmody@gmail.com> > wrote: >> On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: >>> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: >>> > And so a pure BMP-supporting implementation may be a reasonable >>> > compromise. [As long as no surrogate-pairs are there] >> >>> Not if you're working on the internet. There are several critical >>> groups of characters that aren't in the BMP, such as: >> >> Of course. But what has the internet to do with micropython? When I download a script from the Internet to run on my microcontroller, written by somebody in Greece, and it calls print on a Greek string, I should see Greek text even if I'm in Sweden or New Zealand or Japan. A fully localised application would be better, of course, but failing that I shouldn't see moji-bake. > Earlier you said: > >> IOW from pov of a universallly acceptable character set this is mostly >> rubbish > > "Universally acceptable character set" and microcontrollers may well not > meet, but if you're talking about universality, you need Unicode. It's > that simple. > Maybe there's a use-case for a microcontroller that works in ISO-8859-5 > natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2014-06-04 00:58 -0700 |
| Message-ID | <7xoay9w1h0.fsf@ruckus.brouhaha.com> |
| In reply to | #72610 |
Steven D'Aprano <steve@pearwood.info> writes: >> Maybe there's a use-case for a microcontroller that works in ISO-8859-5 >> natively, thus using only eight bits per character, > That won't even make the Russians happy, since in Russia there are > multiple incompatible legacy encodings. I've never understood why not use UTF-8 for everything.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-06-04 11:06 +0100 |
| Message-ID | <mailman.10694.1401876430.18130.python-list@python.org> |
| In reply to | #72614 |
On 04/06/2014 08:58, Paul Rubin wrote: > Steven D'Aprano <steve@pearwood.info> writes: >>> Maybe there's a use-case for a microcontroller that works in ISO-8859-5 >>> natively, thus using only eight bits per character, >> That won't even make the Russians happy, since in Russia there are >> multiple incompatible legacy encodings. > > I've never understood why not use UTF-8 for everything. > me too -mojibaked-ly yrs- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-06-04 06:01 -0500 |
| Message-ID | <mailman.10697.1401879750.18130.python-list@python.org> |
| In reply to | #72614 |
On 2014-06-04 00:58, Paul Rubin wrote: > Steven D'Aprano <steve@pearwood.info> writes: > >> Maybe there's a use-case for a microcontroller that works in > >> ISO-8859-5 natively, thus using only eight bits per character, > > That won't even make the Russians happy, since in Russia there > > are multiple incompatible legacy encodings. > > I've never understood why not use UTF-8 for everything. If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Some of us slice strings for a living. ;-) I understand that using UTF-32 would allow us to maintain O(1) indexing at the cost of every string occupying 4 bytes per character. The FSR (again, as I understand it) allows strings that fit in one-byte-per-character to use that, scaling up to use wider characters internally as they're actually needed/used. At the cost of complexity and non-constant memory space, an O(N) algorithm could be tweaked down to O(log N) by using an internal balanced tree of offsets-to-chunks (where the chunk-size was the size of a block where it was faster to scan linearly than to navigate the tree). One might even endow the algorithm with FSR smarts, so each chunk/fragment could be a different encoding in memory, and linearly iterating over the string would walk the tree, returning each decoded piece. </random_ramblings> -tkc
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-04 14:57 +0300 |
| Message-ID | <8761kgvqdr.fsf@elektro.pacujo.net> |
| In reply to | #72626 |
Tim Chase <python.list@tim.thechases.com>: > On 2014-06-04 00:58, Paul Rubin wrote: >> I've never understood why not use UTF-8 for everything. > > If you use UTF-8 for everything, then you end up in a world where > string-indexing (see ChrisA's other side thread on this topic) is no > longer an O(1) operation, but an O(N) operation. Most string operations are O(N) anyway. Besides, you could try and be smart and keep a recent index cached so simple for loops would be O(N) instead of O(N**2). So the idea of keeping strings internally in UTF-8 might not be all that bad. Marko
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-06-04 07:25 -0500 |
| Message-ID | <mailman.10701.1401884774.18130.python-list@python.org> |
| In reply to | #72629 |
On 2014-06-04 14:57, Marko Rauhamaa wrote: > > If you use UTF-8 for everything, then you end up in a world where > > string-indexing (see ChrisA's other side thread on this topic) is > > no longer an O(1) operation, but an O(N) operation. > > Most string operations are O(N) anyway. Besides, you could try and > be smart and keep a recent index cached so simple for loops would > be O(N) instead of O(N**2). So the idea of keeping strings > internally in UTF-8 might not be all that bad. As mentioned elsewhere, I've got a LOT of code that expects that string indexing is O(1) and rarely are those strings/offsets reused I'm streaming through customer/provider data files, so caching wouldn't do much good other than waste space and the time to maintain them. If I knew that string indexing was O(something non constant), I'd have retooled my algorithms to take that into consider, but that would be a lot of code I'd need to touch. -tkc
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2014-06-04 11:25 -0700 |
| Message-ID | <7x1tv4v8et.fsf@ruckus.brouhaha.com> |
| In reply to | #72632 |
Tim Chase <python.list@tim.thechases.com> writes: > As mentioned elsewhere, I've got a LOT of code that expects that > string indexing is O(1) and rarely are those strings/offsets reused > I'm streaming through customer/provider data files, so caching > wouldn't do much good other than waste space and the time to maintain > them. I'm having trouble understanding -- if they're only used once then what's the problem? You're reading some enormous file into a string and then randomly accessing it by character offset? What size are these strings? I can think of a number of workarounds including language extensions, but mostly I'd be interested in seeing some actual benchmarks of your unmodified program under both representations.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-06-04 12:53 +0100 |
| Message-ID | <mailman.10699.1401882811.18130.python-list@python.org> |
| In reply to | #72614 |
On 04/06/2014 12:01, Tim Chase wrote:
> On 2014-06-04 00:58, Paul Rubin wrote:
>> Steven D'Aprano <steve@pearwood.info> writes:
>>>> Maybe there's a use-case for a microcontroller that works in
>>>> ISO-8859-5 natively, thus using only eight bits per character,
>>> That won't even make the Russians happy, since in Russia there
>>> are multiple incompatible legacy encodings.
>>
>> I've never understood why not use UTF-8 for everything.
>
> If you use UTF-8 for everything, then you end up in a world where
> string-indexing (see ChrisA's other side thread on this topic) is no
> longer an O(1) operation, but an O(N) operation. Some of us slice
> strings for a living. ;-) I understand that using UTF-32 would allow
> us to maintain O(1) indexing at the cost of every string occupying 4
> bytes per character. The FSR (again, as I understand it) allows
> strings that fit in one-byte-per-character to use that, scaling up to
> use wider characters internally as they're actually needed/used.
>
........
I believe that we should distinguish between glyph/character indexing and string
indexing. Even in unicode it may be hard to decide where a visual glyph starts
and ends. I assume most people would like to assign one glyph to one unicode,
but that's not always possible with composed glyphs.
>>> for a in (u'\xc5',u'A\u030a'):
... for o in (u'\xf6',u'o\u0308'):
... u=a+u'ngstr'+o+u'm'
... print("%s %s" % (repr(u),u))
...
u'\xc5ngstr\xf6m' Ångström
u'\xc5ngstro\u0308m' Ångström
u'A\u030angstr\xf6m' Ångström
u'A\u030angstro\u0308m' Ångström
>>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
False
so even unicode doesn't always allow for O(1) glyph indexing. I know this is
artificial, but this is the same situation as utf8 faces just the frequency of
occurrence is different. A very large amount of computing is still western
centric so searching a byte string for latin characters is still efficient;
searching for an n with a tilde on top might not be so easy.
--
Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2014-06-04 15:17 +0300 |
| Message-ID | <871tv4vpgk.fsf@elektro.pacujo.net> |
| In reply to | #72628 |
Robin Becker <robin@reportlab.com>: >>>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m' > False Now *that* would be a valid reason for our resident Unicode expert to complain! Py3 in no way solves text representation issues definitively. > I know this is artificial Not at all. It probably is out of scope for Python, but it is a real cause for human suffering. What's Unicode for "résumé"? Note, for example, that Google manages to sort out issues like these. It sees past diacritics and even case ending. Marko
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-06-04 13:31 +0100 |
| Message-ID | <mailman.10702.1401885085.18130.python-list@python.org> |
| In reply to | #72630 |
On 04/06/2014 13:17, Marko Rauhamaa wrote: ......... > > Note, for example, that Google manages to sort out issues like these. It > sees past diacritics and even case ending. ..... I guess they must normalize all inputs to some standard form and then search / eigenvectorize on those. There are quite a few diacritics and a fair few glyphs they could be applied to. I don't think it likely they could map all possible combinations to a private range. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-06-04 13:51 +0000 |
| Message-ID | <538f246d$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #72628 |
On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:
> I believe that we should distinguish between glyph/character indexing
> and string indexing. Even in unicode it may be hard to decide where a
> visual glyph starts and ends. I assume most people would like to assign
> one glyph to one unicode, but that's not always possible with composed
> glyphs.
>
> >>> for a in (u'\xc5',u'A\u030a'):
> ... for o in (u'\xf6',u'o\u0308'):
> ... u=a+u'ngstr'+o+u'm'
> ... print("%s %s" % (repr(u),u))
> ...
> u'\xc5ngstr\xf6m' Ångström
> u'\xc5ngstro\u0308m' Ångström
> u'A\u030angstr\xf6m' Ångström
> u'A\u030angstro\u0308m' Ångström
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False
>
> so even unicode doesn't always allow for O(1) glyph indexing.
What you're talking about here is "graphemes", not glyphs. Glyphs are the
little pictures that represent the characters when written down.
Graphemes (technically, "grapheme clusters") are the things which native
speakers of a language believe ought to be considered a single unit.
Think of them as similar to letters. That can be quite tricky to
determine, and is dependent on the language you are speaking. The letters
"ch" are considered two letters in English, but only a single letter in
Czech and Slovak.
I believe that *grapheme-aware* text processing is *far* too complicated
for a programming language to promise. If you think that len() needs to
count graphemes, then what should len("ch") return, 1 or 2? Grapheme
processing is a complex, complicated task best left up to powerful
libraries built on top of a sturdy Unicode base.
> I know this is artificial,
But it isn't artificial in the least. Unicode isn't complicated because
it's badly designed, or complicated for the sake of complexity. It's
complicated because human language is complicated. That, and because of
legacy encodings.
> but this is the same situation as utf8 faces just
> the frequency of occurrence is different. A very large amount of
> computing is still western centric so searching a byte string for latin
> characters is still efficient; searching for an n with a tilde on top
> might not be so easy.
This is a good point, but on balance I disagree. A grapheme-aware library
is likely to need to be based on more complex data structures than simple
strings (arrays of code points). But for the underlying relatively simple
string library, graphemes are too hard. Code points are simple, and the
language can deal with code points without caring about their semantics.
For instance, in English, I might not want to insert letters between the
q and u of "queen", since in English u (nearly) always follows q. It
would be inappropriate for the programming language string library to
care about that, and similarly it would be inappropriate for it to care
that u'A\u030a' represents a single grapheme Å.
--
Steven D'Aprano
http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-06-10 00:32 -0700 |
| Message-ID | <0a6ebce7-aa3f-4374-a0a1-004e421a2e15@googlegroups.com> |
| In reply to | #72628 |
Le mercredi 4 juin 2014 13:53:19 UTC+2, Robin Becker a écrit :
> On 04/06/2014 12:01, Tim Chase wrote:
>
> > On 2014-06-04 00:58, Paul Rubin wrote:
>
> >> Steven D'Aprano <steve@pearwood.info> writes:
>
> >>>> Maybe there's a use-case for a microcontroller that works in
>
> >>>> ISO-8859-5 natively, thus using only eight bits per character,
>
> >>> That won't even make the Russians happy, since in Russia there
>
> >>> are multiple incompatible legacy encodings.
>
> >>
>
> >> I've never understood why not use UTF-8 for everything.
>
> >
>
> > If you use UTF-8 for everything, then you end up in a world where
>
> > string-indexing (see ChrisA's other side thread on this topic) is no
>
> > longer an O(1) operation, but an O(N) operation. Some of us slice
>
> > strings for a living. ;-) I understand that using UTF-32 would allow
>
> > us to maintain O(1) indexing at the cost of every string occupying 4
>
> > bytes per character. The FSR (again, as I understand it) allows
>
> > strings that fit in one-byte-per-character to use that, scaling up to
>
> > use wider characters internally as they're actually needed/used.
>
> >
>
> ........
>
> I believe that we should distinguish between glyph/character indexing and string
>
> indexing. Even in unicode it may be hard to decide where a visual glyph starts
>
> and ends. I assume most people would like to assign one glyph to one unicode,
>
> but that's not always possible with composed glyphs.
>
>
>
> >>> for a in (u'\xc5',u'A\u030a'):
>
> ... for o in (u'\xf6',u'o\u0308'):
>
> ... u=a+u'ngstr'+o+u'm'
>
> ... print("%s %s" % (repr(u),u))
>
> ...
>
> u'\xc5ngstr\xf6m' Ångström
>
> u'\xc5ngstro\u0308m' Ångström
>
> u'A\u030angstr\xf6m' Ångström
>
> u'A\u030angstro\u0308m' Ångström
>
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
>
> False
>
>
>
> so even unicode doesn't always allow for O(1) glyph indexing. I know this is
>
> artificial, but this is the same situation as utf8 faces just the frequency of
>
> occurrence is different. A very large amount of computing is still western
>
> centric so searching a byte string for latin characters is still efficient;
>
> searching for an n with a tilde on top might not be so easy.
>
> --
>
> Robin Becker
=========
Python succeeded to become an anti-unicode product!
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-06-10 02:13 -0700 |
| Message-ID | <0f0a2fbe-48df-46e0-a9a0-65896f02e22c@googlegroups.com> |
| In reply to | #73076 |
Le mardi 10 juin 2014 09:32:34 UTC+2, wxjm...@gmail.com a écrit :
> Le mercredi 4 juin 2014 13:53:19 UTC+2, Robin Becker a écrit :
>
> > On 04/06/2014 12:01, Tim Chase wrote:
>
> >
>
> > > On 2014-06-04 00:58, Paul Rubin wrote:
>
> >
>
> > >> Steven D'Aprano <steve@pearwood.info> writes:
>
> >
>
> > >>>> Maybe there's a use-case for a microcontroller that works in
>
> >
>
> > >>>> ISO-8859-5 natively, thus using only eight bits per character,
>
> >
>
> > >>> That won't even make the Russians happy, since in Russia there
>
> >
>
> > >>> are multiple incompatible legacy encodings.
>
> >
>
> > >>
>
> >
>
> > >> I've never understood why not use UTF-8 for everything.
>
> >
>
> > >
>
> >
>
> > > If you use UTF-8 for everything, then you end up in a world where
>
> >
>
> > > string-indexing (see ChrisA's other side thread on this topic) is no
>
> >
>
> > > longer an O(1) operation, but an O(N) operation. Some of us slice
>
> >
>
> > > strings for a living. ;-) I understand that using UTF-32 would allow
>
> >
>
> > > us to maintain O(1) indexing at the cost of every string occupying 4
>
> >
>
> > > bytes per character. The FSR (again, as I understand it) allows
>
> >
>
> > > strings that fit in one-byte-per-character to use that, scaling up to
>
> >
>
> > > use wider characters internally as they're actually needed/used.
>
> >
>
> > >
>
> >
>
> > ........
>
> >
>
> > I believe that we should distinguish between glyph/character indexing and string
>
> >
>
> > indexing. Even in unicode it may be hard to decide where a visual glyph starts
>
> >
>
> > and ends. I assume most people would like to assign one glyph to one unicode,
>
> >
>
> > but that's not always possible with composed glyphs.
>
> >
>
> >
>
> >
>
> > >>> for a in (u'\xc5',u'A\u030a'):
>
> >
>
> > ... for o in (u'\xf6',u'o\u0308'):
>
> >
>
> > ... u=a+u'ngstr'+o+u'm'
>
> >
>
> > ... print("%s %s" % (repr(u),u))
>
> >
>
> > ...
>
> >
>
> > u'\xc5ngstr\xf6m' Ångström
>
> >
>
> > u'\xc5ngstro\u0308m' Ångström
>
> >
>
> > u'A\u030angstr\xf6m' Ångström
>
> >
>
> > u'A\u030angstro\u0308m' Ångström
>
> >
>
> > >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
>
> >
>
> > False
>
> >
>
> >
>
> >
>
> > so even unicode doesn't always allow for O(1) glyph indexing. I know this is
>
> >
>
> > artificial, but this is the same situation as utf8 faces just the frequency of
>
> >
>
> > occurrence is different. A very large amount of computing is still western
>
> >
>
> > centric so searching a byte string for latin characters is still efficient;
>
> >
>
> > searching for an n with a tilde on top might not be so easy.
>
> >
>
> > --
>
> >
>
> > Robin Becker
>
>
>
> =========
>
>
>
> Python succeeded to become an anti-unicode product!
>
>
>
> jmf
-----
And deeply buggy!
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-06-04 07:21 -0500 |
| Message-ID | <mailman.10700.1401884522.18130.python-list@python.org> |
| In reply to | #72614 |
On 2014-06-04 12:53, Robin Becker wrote: > > If you use UTF-8 for everything, then you end up in a world where > > string-indexing (see ChrisA's other side thread on this topic) is > > no longer an O(1) operation, but an O(N) operation. Some of us > > slice strings for a living. ;-) > ........ > I believe that we should distinguish between glyph/character > indexing and string indexing. I'm only talking about string indexing using my_string[some_slice] which is traditionally O(1) and breaking that [cw]ould cause unexpected performance degradation. -tkc
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web