Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #72553 > unrolled thread

Re: Micro Python -- a lean and efficient implementation of Python 3

Started byPaul Sokolovsky <pmiscml@gmail.com>
First post2014-06-04 00:41 +0300
Last post2014-06-04 17:10 +1000
Articles 20 on this page of 35 — 15 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Sokolovsky <pmiscml@gmail.com> - 2014-06-04 00:41 +0300
    Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 20:37 -0700
      Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 13:52 +1000
        Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 21:40 -0700
          Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:02 -0600
          Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:16 +1000
            Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 07:42 +0000
              Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 00:58 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 11:06 +0100
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 06:01 -0500
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 14:57 +0300
                    Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:25 -0500
                      Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 11:25 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 12:53 +0100
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 15:17 +0300
                    Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 13:31 +0100
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:51 +0000
                  Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 00:32 -0700
                    Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 02:13 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:21 -0500
                Re: Micro Python -- a lean and efficient implementation of Python 3 Travis Griggs <travisgriggs@gmail.com> - 2014-06-06 09:59 -0700
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-06 13:29 -0400
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-06 21:20 -0500
                  Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 12:27 -0700
          Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:20 +1000
          Re: Micro Python -- a lean and efficient implementation of Python 3 Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-04 10:00 +0200
        Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-04 14:42 -0400
          Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-04 19:06 -0700
            Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-05 09:59 -0400
              Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:33 +1000
      Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 05:20 +0000
        Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 22:36 -0700
        Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:55 -0600
        Re: Micro Python -- a lean and efficient implementation of Python 3 Terry Reedy <tjreedy@udel.edu> - 2014-06-04 03:00 -0400
        Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:10 +1000

Page 1 of 2  [1] 2  Next page →


#72553 — Re: Micro Python -- a lean and efficient implementation of Python 3

FromPaul Sokolovsky <pmiscml@gmail.com>
Date2014-06-04 00:41 +0300
SubjectRe: Micro Python -- a lean and efficient implementation of Python 3
Message-ID<mailman.10646.1401831682.18130.python-list@python.org>
Hello,

On Wed, 4 Jun 2014 03:08:57 +1000
Chris Angelico <rosuav@gmail.com> wrote:

[]

> With that encouragement, I just cloned your repo and built it on amd64
> Debian Wheezy. Works just fine! Except... I've just found one fairly
> major problem with your support of Python 3.x syntax. Your str type is
> documented as not supporting Unicode. Is that a current flaw that
> you're planning to remove, or a design limitation? Either way, I'm a
> bit dubious about a purported version 1 that doesn't do one of the
> things that Py3 is especially good at - matched by very few languages
> in its encouragement of best practice with Unicode support.

I should start with saying that it's MicroPython what made me look at
Python3. So for me, it already did lot of boon by getting me from under
the rock, so now instead of "at my job, we use python 2.x" I may report
"at my job, we don't wait when our distro will kick us in the ass, and
add 'from __future__ import print_function' whenever we touch some
code".

With that in mind, I, as many others, think that forcing Unicode bloat
upon people by default is the most controversial feature of Python3.
The reason is that you go very long way dealing with languages of the
people of the world by just treating strings as consisting of 8-bit
data. I'd say, that's enough for 90% of applications. Unicode is needed
only if one needs to deal with multiple languages *at the same time*,
which is fairly rare (remaining 10% of apps).

And please keep in mind that MicroPython was originally intended (and
should be remain scalable down to) an MCU. Unicode needed there is even
less, and even less resources to support Unicode just because.

> 
> What is your str type actually able to support? It seems to store
> non-ASCII bytes in it, which I presume are supposed to represent the
> rest of Latin-1, but I wasn't able to print them out:

There's a work-in-progress on documenting differences between CPython
and MicroPython at
https://github.com/micropython/micropython/wiki/Differences, it gives
following account on this:

"No unicode support is actually implemented. Python3 calls for strict
difference between str and bytes data types (unlike Python2, which has
neutral unified data type for strings and binary data, and separates
out unicode data type). MicroPython faithfully implements str/bytes
separation, but currently, underlying str implementation is the same as
bytes. This means strings in MicroPython are not unicode, but 8-bit
characters (fully binary-clean)."

> 
> Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
> >>> print("asdf\xfdqwer")
> 
> Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
> [GCC 4.7.2] on linux
> >>> print("asdf\xfdqwer")
> asdfýqwer
> 
> In fact, printing seems to work with bytes:
> 
> >>> print("asdf\xc3\xbdqwer")
> asdfýqwer
> 
> (my terminal uses UTF-8, this is the UTF-8 encoding of the above
> string)
> 
> I would strongly recommend either implementing all of PEP 393, or at
> least making it very clear that this pretends everything is bytes -
> and possibly disallowing any codepoint >127 in any string, which will
> at least mean you're safe on all ASCII-compatible encodings.

MicroPython is not the first "tiny" Python implementation. What differs
MicroPython is that it's neither aim or motto to be a subset of
language. And yet, it's not CPython rewrite either. So, while Unicode
support is surely possible, it's unlikely to be done as "all of
PEPxxx". If you ask me, I'd personally envision it to be implemented as
UTF-8 (in this regard I agree with (or take an influence from) 
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/). But I don't have plans
to work on Unicode any time soon - applications I envision for
MicroPython so far fit in those 90% that live happily without Unicode.

But generally, there's no strict roadmap for MicroPython features.
While core of the language (parser, compiler, VM) is developed by
Damien, many other features were already contributed by the community
(project went open-source at the beginning of the year). So, if someone
will want to see Unicode support up to the level of providing patches,
it gladly will be accepted. The only thing we established is that we
want to be able to scale down, and thus almost all features should be
configurable.


> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list



-- 
Best regards,
 Paul                          mailto:pmiscml@gmail.com

[toc] | [next] | [standalone]


#72582

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-03 20:37 -0700
Message-ID<44acd692-5dcd-4e5f-8238-7fbe0de4db2a@googlegroups.com>
In reply to#72553
On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:

> With that in mind, I, as many others, think that forcing Unicode bloat
> upon people by default is the most controversial feature of Python3.
> The reason is that you go very long way dealing with languages of the
> people of the world by just treating strings as consisting of 8-bit
> data. I'd say, that's enough for 90% of applications. Unicode is needed
> only if one needs to deal with multiple languages *at the same time*,
> which is fairly rare (remaining 10% of apps).

> And please keep in mind that MicroPython was originally intended (and
> should be remain scalable down to) an MCU. Unicode needed there is even
> less, and even less resources to support Unicode just because.

At some time (when jmf was making more intelligible noises) I had
suggested that the choice between 1/2/4 byte strings that happens at
runtime in python3's FSR can be made at python-start time with a
command-line switch.  There are many combinations here; here is one in
more detail:

Instead of having one (FSR) string engine, you have (upto) 4

- a pure 1 byte (ASCII)
- a pure 2 byte (BMP) with decode-failures for out-of-ranges
- a pure 4 byte -- everything UTF-32
- FSR dynamic switching at runtime (with massive moping from the world's jmfs)

The point is that only one of these engines would be brought into memory
based on command-line/config options.

Some more personal thoughts (that may be quite ill-informed!):

1. I regard myself as a unicode ignoramus+enthusiast. The world will
be a better place if unicode is more pervasive.
See http://blog.languager.org/2014/04/unicoded-python.html

As it happens I am also a computer scientist -- I understand that in
contexts where anything other than 8-bit chars is unacceptably
inefficient, unicode-bloat may be a real thing.

2. My casual/cursory reading of the contents of the SMP-planes
suggests that the stuff there is are things like
- egyptian hieroplyphics
- mahjong characters
- ancient greek musical symbols
- alchemical symbols etc etc.

IOW from pov of a universallly acceptable character set this is mostly
rubbish

And so a pure BMP-supporting implementation may be a reasonable
compromise. [As long as no surrogate-pairs are there]

[toc] | [prev] | [next] | [standalone]


#72583

FromChris Angelico <rosuav@gmail.com>
Date2014-06-04 13:52 +1000
Message-ID<mailman.10673.1401853976.18130.python-list@python.org>
In reply to#72582
On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> 2. My casual/cursory reading of the contents of the SMP-planes
> suggests that the stuff there is are things like
> - egyptian hieroplyphics
> - mahjong characters
> - ancient greek musical symbols
> - alchemical symbols etc etc.
>
> IOW from pov of a universallly acceptable character set this is mostly
> rubbish
>
> And so a pure BMP-supporting implementation may be a reasonable
> compromise. [As long as no surrogate-pairs are there]

Not if you're working on the internet. There are several critical
groups of characters that aren't in the BMP, such as:

1) Most or all Chinese and Japanese characters
2) Heaps of emoticons and fancy letters
3) Mathematical symbols

You can't ignore those. You might be able to say "Well, my program
will run slower if you throw these at it", but if you're going down
that route, you probably want the full FSR and the advantages it
confers on ASCII and Latin-1 strings. Binding your program to BMP-only
is nearly as dangerous as binding it to ASCII-only; potentially worse,
because you can run an awful lot of artificial tests without
remembering to stick in some astral characters.

It's not rubbish. It's important stuff that you need to deal with.

ChrisA

[toc] | [prev] | [next] | [standalone]


#72588

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-03 21:40 -0700
Message-ID<c04434ce-cbc4-49ab-b312-24f1631dd894@googlegroups.com>
In reply to#72583
On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
> > And so a pure BMP-supporting implementation may be a reasonable
> > compromise. [As long as no surrogate-pairs are there]

> Not if you're working on the internet. There are several critical
> groups of characters that aren't in the BMP, such as:

Of course. But what has the internet to do with micropython?

This is their stated goal:
| Micro Python is a lean and fast implementation of the Python
| programming language (python.org) that is optimised to run on a
| microcontroller.


> 1) Most or all Chinese and Japanese characters

Dont know how you count 'most'

| One possible rationale is the desire to limit the size of the full
| Unicode character set, where CJK characters as represented by discrete
| ideograms may approach or exceed 100,000 (while those required for
| ordinary literacy in any language are probably under 3,000). Version 1
| of Unicode was designed to fit into 16 bits and only 20,940 characters
| (32%) out of the possible 65,536 were reserved for these CJK Unified
| Ideographs. Later Unicode has been extended to 21 bits allowing many
| more CJK characters (75,960 are assigned, with room for more).

| From http://en.wikipedia.org/wiki/Han_unification

[toc] | [prev] | [next] | [standalone]


#72593

FromIan Kelly <ian.g.kelly@gmail.com>
Date2014-06-03 23:02 -0600
Message-ID<mailman.10677.1401858199.18130.python-list@python.org>
In reply to#72588
On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
>> 1) Most or all Chinese and Japanese characters
>
> Dont know how you count 'most'
>
> | One possible rationale is the desire to limit the size of the full
> | Unicode character set, where CJK characters as represented by discrete
> | ideograms may approach or exceed 100,000 (while those required for
> | ordinary literacy in any language are probably under 3,000). Version 1
> | of Unicode was designed to fit into 16 bits and only 20,940 characters
> | (32%) out of the possible 65,536 were reserved for these CJK Unified
> | Ideographs. Later Unicode has been extended to 21 bits allowing many
> | more CJK characters (75,960 are assigned, with room for more).
>
> | From http://en.wikipedia.org/wiki/Han_unification

So there are 20,940 CJK characters in the BMP, and approximately
55,000 more in the SIP.  I'd count 55,000 out of 75,960 as "most".

[toc] | [prev] | [next] | [standalone]


#72606

FromChris Angelico <rosuav@gmail.com>
Date2014-06-04 17:16 +1000
Message-ID<mailman.10684.1401866176.18130.python-list@python.org>
In reply to#72588
On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
>> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
>> > And so a pure BMP-supporting implementation may be a reasonable
>> > compromise. [As long as no surrogate-pairs are there]
>
>> Not if you're working on the internet. There are several critical
>> groups of characters that aren't in the BMP, such as:
>
> Of course. But what has the internet to do with micropython?

Earlier you said:

> IOW from pov of a universallly acceptable character set this is mostly
> rubbish

"Universally acceptable character set" and microcontrollers may well
not meet, but if you're talking about universality, you need Unicode.
It's that simple.

Maybe there's a use-case for a microcontroller that works in
ISO-8859-5 natively, thus using only eight bits per character, but
even if there is, I would expect a Python implementation on it to
expose Unicode codepoints in its strings. (Most of the time you won't
even be aware of the exact codepoint values. It's only when you put
\xNN or \uNNNN or U000NNNNN escapes into your strings, or explicitly
use ord/chr or equivalent, that it'd make a difference.) The point is
not that you might be able to get away with sticking your head in the
sand and wishing Unicode would just go away. Even if you can, it's not
something Python 3 can ever do.

And I don't think anybody can, anyway. If your device is big enough to
hold Python, it should be big enough to handle Unicode; and then you
don't have to say "Oh, sorry rest-of-the-world, this only works in
English... and only a subset of English... and stuff".

ChrisA

[toc] | [prev] | [next] | [standalone]


#72610

FromSteven D'Aprano <steve@pearwood.info>
Date2014-06-04 07:42 +0000
Message-ID<538ecdef$0$11109$c3e8da3@news.astraweb.com>
In reply to#72606
On Wed, 04 Jun 2014 17:16:13 +1000, Chris Angelico wrote:

> On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody <rustompmody@gmail.com>
> wrote:
>> On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
>>> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
>>> > And so a pure BMP-supporting implementation may be a reasonable
>>> > compromise. [As long as no surrogate-pairs are there]
>>
>>> Not if you're working on the internet. There are several critical
>>> groups of characters that aren't in the BMP, such as:
>>
>> Of course. But what has the internet to do with micropython?

When I download a script from the Internet to run on my microcontroller, 
written by somebody in Greece, and it calls print on a Greek string, I 
should see Greek text even if I'm in Sweden or New Zealand or Japan.

A fully localised application would be better, of course, but failing 
that I shouldn't see moji-bake.


> Earlier you said:
> 
>> IOW from pov of a universallly acceptable character set this is mostly
>> rubbish
> 
> "Universally acceptable character set" and microcontrollers may well not
> meet, but if you're talking about universality, you need Unicode. It's
> that simple.

 
> Maybe there's a use-case for a microcontroller that works in ISO-8859-5
> natively, thus using only eight bits per character, 

That won't even make the Russians happy, since in Russia there are 
multiple incompatible legacy encodings.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#72614

FromPaul Rubin <no.email@nospam.invalid>
Date2014-06-04 00:58 -0700
Message-ID<7xoay9w1h0.fsf@ruckus.brouhaha.com>
In reply to#72610
Steven D'Aprano <steve@pearwood.info> writes:
>> Maybe there's a use-case for a microcontroller that works in ISO-8859-5
>> natively, thus using only eight bits per character, 
> That won't even make the Russians happy, since in Russia there are 
> multiple incompatible legacy encodings.

I've never understood why not use UTF-8 for everything.

[toc] | [prev] | [next] | [standalone]


#72621

FromRobin Becker <robin@reportlab.com>
Date2014-06-04 11:06 +0100
Message-ID<mailman.10694.1401876430.18130.python-list@python.org>
In reply to#72614
On 04/06/2014 08:58, Paul Rubin wrote:
> Steven D'Aprano <steve@pearwood.info> writes:
>>> Maybe there's a use-case for a microcontroller that works in ISO-8859-5
>>> natively, thus using only eight bits per character,
>> That won't even make the Russians happy, since in Russia there are
>> multiple incompatible legacy encodings.
>
> I've never understood why not use UTF-8 for everything.
>
me too

-mojibaked-ly yrs-
Robin Becker

[toc] | [prev] | [next] | [standalone]


#72626

FromTim Chase <python.list@tim.thechases.com>
Date2014-06-04 06:01 -0500
Message-ID<mailman.10697.1401879750.18130.python-list@python.org>
In reply to#72614
On 2014-06-04 00:58, Paul Rubin wrote:
> Steven D'Aprano <steve@pearwood.info> writes:
> >> Maybe there's a use-case for a microcontroller that works in
> >> ISO-8859-5 natively, thus using only eight bits per character, 
> > That won't even make the Russians happy, since in Russia there
> > are multiple incompatible legacy encodings.
> 
> I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree).  One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded
piece. </random_ramblings>

-tkc



[toc] | [prev] | [next] | [standalone]


#72629

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-04 14:57 +0300
Message-ID<8761kgvqdr.fsf@elektro.pacujo.net>
In reply to#72626
Tim Chase <python.list@tim.thechases.com>:

> On 2014-06-04 00:58, Paul Rubin wrote:
>> I've never understood why not use UTF-8 for everything.
>
> If you use UTF-8 for everything, then you end up in a world where
> string-indexing (see ChrisA's other side thread on this topic) is no
> longer an O(1) operation, but an O(N) operation.

Most string operations are O(N) anyway. Besides, you could try and be
smart and keep a recent index cached so simple for loops would be O(N)
instead of O(N**2). So the idea of keeping strings internally in UTF-8
might not be all that bad.


Marko

[toc] | [prev] | [next] | [standalone]


#72632

FromTim Chase <python.list@tim.thechases.com>
Date2014-06-04 07:25 -0500
Message-ID<mailman.10701.1401884774.18130.python-list@python.org>
In reply to#72629
On 2014-06-04 14:57, Marko Rauhamaa wrote:
> > If you use UTF-8 for everything, then you end up in a world where
> > string-indexing (see ChrisA's other side thread on this topic) is
> > no longer an O(1) operation, but an O(N) operation.  
> 
> Most string operations are O(N) anyway. Besides, you could try and
> be smart and keep a recent index cached so simple for loops would
> be O(N) instead of O(N**2). So the idea of keeping strings
> internally in UTF-8 might not be all that bad.

As mentioned elsewhere, I've got a LOT of code that expects that
string indexing is O(1) and rarely are those strings/offsets reused
I'm streaming through customer/provider data files, so caching
wouldn't do much good other than waste space and the time to maintain
them.

If I knew that string indexing was O(something non constant), I'd
have retooled my algorithms to take that into consider, but that
would be a lot of code I'd need to touch.

-tkc


[toc] | [prev] | [next] | [standalone]


#72651

FromPaul Rubin <no.email@nospam.invalid>
Date2014-06-04 11:25 -0700
Message-ID<7x1tv4v8et.fsf@ruckus.brouhaha.com>
In reply to#72632
Tim Chase <python.list@tim.thechases.com> writes:
> As mentioned elsewhere, I've got a LOT of code that expects that
> string indexing is O(1) and rarely are those strings/offsets reused
> I'm streaming through customer/provider data files, so caching
> wouldn't do much good other than waste space and the time to maintain
> them.

I'm having trouble understanding -- if they're only used once then
what's the problem?  You're reading some enormous file into a string and
then randomly accessing it by character offset?  What size are these
strings?  I can think of a number of workarounds including language
extensions, but mostly I'd be interested in seeing some actual
benchmarks of your unmodified program under both representations.

[toc] | [prev] | [next] | [standalone]


#72628

FromRobin Becker <robin@reportlab.com>
Date2014-06-04 12:53 +0100
Message-ID<mailman.10699.1401882811.18130.python-list@python.org>
In reply to#72614
On 04/06/2014 12:01, Tim Chase wrote:
> On 2014-06-04 00:58, Paul Rubin wrote:
>> Steven D'Aprano <steve@pearwood.info> writes:
>>>> Maybe there's a use-case for a microcontroller that works in
>>>> ISO-8859-5 natively, thus using only eight bits per character,
>>> That won't even make the Russians happy, since in Russia there
>>> are multiple incompatible legacy encodings.
>>
>> I've never understood why not use UTF-8 for everything.
>
> If you use UTF-8 for everything, then you end up in a world where
> string-indexing (see ChrisA's other side thread on this topic) is no
> longer an O(1) operation, but an O(N) operation.  Some of us slice
> strings for a living. ;-)  I understand that using UTF-32 would allow
> us to maintain O(1) indexing at the cost of every string occupying 4
> bytes per character.  The FSR (again, as I understand it) allows
> strings that fit in one-byte-per-character to use that, scaling up to
> use wider characters internally as they're actually needed/used.
>
........
I believe that we should distinguish between glyph/character indexing and string 
indexing. Even in unicode it may be hard to decide where a visual glyph starts 
and ends. I assume most people would like to assign one glyph to one unicode, 
but that's not always possible with composed glyphs.

 >>> for a in (u'\xc5',u'A\u030a'):
... 	for o in (u'\xf6',u'o\u0308'):
... 		u=a+u'ngstr'+o+u'm'
... 		print("%s %s" % (repr(u),u))
...
u'\xc5ngstr\xf6m' Ångström
u'\xc5ngstro\u0308m' Ångström
u'A\u030angstr\xf6m' Ångström
u'A\u030angstro\u0308m' Ångström
 >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
False

so even unicode doesn't always allow for O(1) glyph indexing. I know this is 
artificial, but this is the same situation as utf8 faces just the frequency of 
occurrence is different. A very large amount of computing is still western 
centric so searching a byte string for latin characters is still efficient; 
searching for an n with a tilde on top might not be so easy.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#72630

FromMarko Rauhamaa <marko@pacujo.net>
Date2014-06-04 15:17 +0300
Message-ID<871tv4vpgk.fsf@elektro.pacujo.net>
In reply to#72628
Robin Becker <robin@reportlab.com>:

>>>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False

Now *that* would be a valid reason for our resident Unicode expert to
complain! Py3 in no way solves text representation issues definitively.

> I know this is artificial

Not at all. It probably is out of scope for Python, but it is a real
cause for human suffering. What's Unicode for "résumé"?

Note, for example, that Google manages to sort out issues like these. It
sees past diacritics and even case ending.


Marko

[toc] | [prev] | [next] | [standalone]


#72633

FromRobin Becker <robin@reportlab.com>
Date2014-06-04 13:31 +0100
Message-ID<mailman.10702.1401885085.18130.python-list@python.org>
In reply to#72630
On 04/06/2014 13:17, Marko Rauhamaa wrote:
.........
>
> Note, for example, that Google manages to sort out issues like these. It
> sees past diacritics and even case ending.
.....
I guess they must normalize all inputs to some standard form and then search / 
eigenvectorize on those. There are quite a few diacritics and a fair few glyphs 
they could be applied to. I don't think it likely they could map all possible 
combinations to a private range.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#72637

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-06-04 13:51 +0000
Message-ID<538f246d$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#72628
On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:

> I believe that we should distinguish between glyph/character indexing
> and string indexing. Even in unicode it may be hard to decide where a
> visual glyph starts and ends. I assume most people would like to assign
> one glyph to one unicode, but that's not always possible with composed
> glyphs.
> 
>  >>> for a in (u'\xc5',u'A\u030a'):
> ... 	for o in (u'\xf6',u'o\u0308'):
> ... 		u=a+u'ngstr'+o+u'm'
> ... 		print("%s %s" % (repr(u),u))
> ...
> u'\xc5ngstr\xf6m' Ångström
> u'\xc5ngstro\u0308m' Ångström
> u'A\u030angstr\xf6m' Ångström
> u'A\u030angstro\u0308m' Ångström
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False
> 
> so even unicode doesn't always allow for O(1) glyph indexing.

What you're talking about here is "graphemes", not glyphs. Glyphs are the 
little pictures that represent the characters when written down. 
Graphemes (technically, "grapheme clusters") are the things which native 
speakers of a language believe ought to be considered a single unit. 
Think of them as similar to letters. That can be quite tricky to 
determine, and is dependent on the language you are speaking. The letters 
"ch" are considered two letters in English, but only a single letter in 
Czech and Slovak.

I believe that *grapheme-aware* text processing is *far* too complicated 
for a programming language to promise. If you think that len() needs to 
count graphemes, then what should len("ch") return, 1 or 2? Grapheme 
processing is a complex, complicated task best left up to powerful 
libraries built on top of a sturdy Unicode base.

> I know this is artificial, 

But it isn't artificial in the least. Unicode isn't complicated because 
it's badly designed, or complicated for the sake of complexity. It's 
complicated because human language is complicated. That, and because of 
legacy encodings.


> but this is the same situation as utf8 faces just
> the frequency of occurrence is different. A very large amount of
> computing is still western centric so searching a byte string for latin
> characters is still efficient; searching for an n with a tilde on top
> might not be so easy.

This is a good point, but on balance I disagree. A grapheme-aware library 
is likely to need to be based on more complex data structures than simple 
strings (arrays of code points). But for the underlying relatively simple 
string library, graphemes are too hard. Code points are simple, and the 
language can deal with code points without caring about their semantics. 
For instance, in English, I might not want to insert letters between the 
q and u of "queen", since in English u (nearly) always follows q. It 
would be inappropriate for the programming language string library to 
care about that, and similarly it would be inappropriate for it to care 
that u'A\u030a' represents a single grapheme Å.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#73076

Fromwxjmfauth@gmail.com
Date2014-06-10 00:32 -0700
Message-ID<0a6ebce7-aa3f-4374-a0a1-004e421a2e15@googlegroups.com>
In reply to#72628
Le mercredi 4 juin 2014 13:53:19 UTC+2, Robin Becker a écrit :
> On 04/06/2014 12:01, Tim Chase wrote:
> 
> > On 2014-06-04 00:58, Paul Rubin wrote:
> 
> >> Steven D'Aprano <steve@pearwood.info> writes:
> 
> >>>> Maybe there's a use-case for a microcontroller that works in
> 
> >>>> ISO-8859-5 natively, thus using only eight bits per character,
> 
> >>> That won't even make the Russians happy, since in Russia there
> 
> >>> are multiple incompatible legacy encodings.
> 
> >>
> 
> >> I've never understood why not use UTF-8 for everything.
> 
> >
> 
> > If you use UTF-8 for everything, then you end up in a world where
> 
> > string-indexing (see ChrisA's other side thread on this topic) is no
> 
> > longer an O(1) operation, but an O(N) operation.  Some of us slice
> 
> > strings for a living. ;-)  I understand that using UTF-32 would allow
> 
> > us to maintain O(1) indexing at the cost of every string occupying 4
> 
> > bytes per character.  The FSR (again, as I understand it) allows
> 
> > strings that fit in one-byte-per-character to use that, scaling up to
> 
> > use wider characters internally as they're actually needed/used.
> 
> >
> 
> ........
> 
> I believe that we should distinguish between glyph/character indexing and string 
> 
> indexing. Even in unicode it may be hard to decide where a visual glyph starts 
> 
> and ends. I assume most people would like to assign one glyph to one unicode, 
> 
> but that's not always possible with composed glyphs.
> 
> 
> 
>  >>> for a in (u'\xc5',u'A\u030a'):
> 
> ... 	for o in (u'\xf6',u'o\u0308'):
> 
> ... 		u=a+u'ngstr'+o+u'm'
> 
> ... 		print("%s %s" % (repr(u),u))
> 
> ...
> 
> u'\xc5ngstr\xf6m' Ångström
> 
> u'\xc5ngstro\u0308m' Ångström
> 
> u'A\u030angstr\xf6m' Ångström
> 
> u'A\u030angstro\u0308m' Ångström
> 
>  >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> 
> False
> 
> 
> 
> so even unicode doesn't always allow for O(1) glyph indexing. I know this is 
> 
> artificial, but this is the same situation as utf8 faces just the frequency of 
> 
> occurrence is different. A very large amount of computing is still western 
> 
> centric so searching a byte string for latin characters is still efficient; 
> 
> searching for an n with a tilde on top might not be so easy.
> 
> -- 
> 
> Robin Becker

=========

Python succeeded to become an anti-unicode product!

jmf

[toc] | [prev] | [next] | [standalone]


#73080

Fromwxjmfauth@gmail.com
Date2014-06-10 02:13 -0700
Message-ID<0f0a2fbe-48df-46e0-a9a0-65896f02e22c@googlegroups.com>
In reply to#73076
Le mardi 10 juin 2014 09:32:34 UTC+2, wxjm...@gmail.com a écrit :
> Le mercredi 4 juin 2014 13:53:19 UTC+2, Robin Becker a écrit :
> 
> > On 04/06/2014 12:01, Tim Chase wrote:
> 
> > 
> 
> > > On 2014-06-04 00:58, Paul Rubin wrote:
> 
> > 
> 
> > >> Steven D'Aprano <steve@pearwood.info> writes:
> 
> > 
> 
> > >>>> Maybe there's a use-case for a microcontroller that works in
> 
> > 
> 
> > >>>> ISO-8859-5 natively, thus using only eight bits per character,
> 
> > 
> 
> > >>> That won't even make the Russians happy, since in Russia there
> 
> > 
> 
> > >>> are multiple incompatible legacy encodings.
> 
> > 
> 
> > >>
> 
> > 
> 
> > >> I've never understood why not use UTF-8 for everything.
> 
> > 
> 
> > >
> 
> > 
> 
> > > If you use UTF-8 for everything, then you end up in a world where
> 
> > 
> 
> > > string-indexing (see ChrisA's other side thread on this topic) is no
> 
> > 
> 
> > > longer an O(1) operation, but an O(N) operation.  Some of us slice
> 
> > 
> 
> > > strings for a living. ;-)  I understand that using UTF-32 would allow
> 
> > 
> 
> > > us to maintain O(1) indexing at the cost of every string occupying 4
> 
> > 
> 
> > > bytes per character.  The FSR (again, as I understand it) allows
> 
> > 
> 
> > > strings that fit in one-byte-per-character to use that, scaling up to
> 
> > 
> 
> > > use wider characters internally as they're actually needed/used.
> 
> > 
> 
> > >
> 
> > 
> 
> > ........
> 
> > 
> 
> > I believe that we should distinguish between glyph/character indexing and string 
> 
> > 
> 
> > indexing. Even in unicode it may be hard to decide where a visual glyph starts 
> 
> > 
> 
> > and ends. I assume most people would like to assign one glyph to one unicode, 
> 
> > 
> 
> > but that's not always possible with composed glyphs.
> 
> > 
> 
> > 
> 
> > 
> 
> >  >>> for a in (u'\xc5',u'A\u030a'):
> 
> > 
> 
> > ... 	for o in (u'\xf6',u'o\u0308'):
> 
> > 
> 
> > ... 		u=a+u'ngstr'+o+u'm'
> 
> > 
> 
> > ... 		print("%s %s" % (repr(u),u))
> 
> > 
> 
> > ...
> 
> > 
> 
> > u'\xc5ngstr\xf6m' Ångström
> 
> > 
> 
> > u'\xc5ngstro\u0308m' Ångström
> 
> > 
> 
> > u'A\u030angstr\xf6m' Ångström
> 
> > 
> 
> > u'A\u030angstro\u0308m' Ångström
> 
> > 
> 
> >  >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> 
> > 
> 
> > False
> 
> > 
> 
> > 
> 
> > 
> 
> > so even unicode doesn't always allow for O(1) glyph indexing. I know this is 
> 
> > 
> 
> > artificial, but this is the same situation as utf8 faces just the frequency of 
> 
> > 
> 
> > occurrence is different. A very large amount of computing is still western 
> 
> > 
> 
> > centric so searching a byte string for latin characters is still efficient; 
> 
> > 
> 
> > searching for an n with a tilde on top might not be so easy.
> 
> > 
> 
> > -- 
> 
> > 
> 
> > Robin Becker
> 
> 
> 
> =========
> 
> 
> 
> Python succeeded to become an anti-unicode product!
> 
> 
> 
> jmf

-----

And deeply buggy!

[toc] | [prev] | [next] | [standalone]


#72631

FromTim Chase <python.list@tim.thechases.com>
Date2014-06-04 07:21 -0500
Message-ID<mailman.10700.1401884522.18130.python-list@python.org>
In reply to#72614
On 2014-06-04 12:53, Robin Becker wrote:
> > If you use UTF-8 for everything, then you end up in a world where
> > string-indexing (see ChrisA's other side thread on this topic) is
> > no longer an O(1) operation, but an O(N) operation.  Some of us
> > slice strings for a living. ;-)
> ........
> I believe that we should distinguish between glyph/character
> indexing and string indexing. 

I'm only talking about string indexing using my_string[some_slice]
which is traditionally O(1) and breaking that [cw]ould cause
unexpected performance degradation.

-tkc

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web