Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #72553 > unrolled thread

Re: Micro Python -- a lean and efficient implementation of Python 3

Started byPaul Sokolovsky <pmiscml@gmail.com>
First post2014-06-04 00:41 +0300
Last post2014-06-04 17:10 +1000
Articles 15 on this page of 35 — 15 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Sokolovsky <pmiscml@gmail.com> - 2014-06-04 00:41 +0300
    Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 20:37 -0700
      Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 13:52 +1000
        Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 21:40 -0700
          Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:02 -0600
          Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:16 +1000
            Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 07:42 +0000
              Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 00:58 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 11:06 +0100
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 06:01 -0500
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 14:57 +0300
                    Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:25 -0500
                      Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 11:25 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 12:53 +0100
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 15:17 +0300
                    Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 13:31 +0100
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:51 +0000
                  Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 00:32 -0700
                    Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 02:13 -0700
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:21 -0500
                Re: Micro Python -- a lean and efficient implementation of Python 3 Travis Griggs <travisgriggs@gmail.com> - 2014-06-06 09:59 -0700
                  Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-06 13:29 -0400
                Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-06 21:20 -0500
                  Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 12:27 -0700
          Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:20 +1000
          Re: Micro Python -- a lean and efficient implementation of Python 3 Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-04 10:00 +0200
        Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-04 14:42 -0400
          Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-04 19:06 -0700
            Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-05 09:59 -0400
              Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:33 +1000
      Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 05:20 +0000
        Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 22:36 -0700
        Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:55 -0600
        Re: Micro Python -- a lean and efficient implementation of Python 3 Terry Reedy <tjreedy@udel.edu> - 2014-06-04 03:00 -0400
        Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:10 +1000

Page 2 of 2 — ← Prev page 1 [2]


#72861

FromTravis Griggs <travisgriggs@gmail.com>
Date2014-06-06 09:59 -0700
Message-ID<mailman.10822.1402073958.18130.python-list@python.org>
In reply to#72614
On Jun 4, 2014, at 4:01 AM, Tim Chase <python.list@tim.thechases.com> wrote:

> If you use UTF-8 for everything

It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. It’s universal, not prone to endian issues, etc. So one *advantage* you gain for using utf8 internally, is any time you need to hand a string to an external thing, it’s just ready. An app that reserves its internal string processing to streaming based ones but has to to hand strings to external libraries a lot (e.g. cairo) might actually benefit using utf8 internally, because a) it’s not doing the linear search for the odd character address and b) it no longer needs to decode/encode every time it sends or receives a string to an external library.

[toc] | [prev] | [next] | [standalone]


#72871

FromRoy Smith <roy@panix.com>
Date2014-06-06 13:29 -0400
Message-ID<roy-4FAEBF.13291606062014@news.panix.com>
In reply to#72861
In article <mailman.10822.1402073958.18130.python-list@python.org>,
 Travis Griggs <travisgriggs@gmail.com> wrote:

> On Jun 4, 2014, at 4:01 AM, Tim Chase <python.list@tim.thechases.com> wrote:
> 
> > If you use UTF-8 for everything
> 
> It seems to me, that increasingly other libraries (C, etc), use utf8 as the 
> preferred string interchange format. Itąs universal, not prone to endian 
> issues, etc.

One of the important etc factors is, "Since it's the most commonly used, 
it's the one that other people are most likely to have implemented 
correctly".  In the real world, these are important considerations.

[toc] | [prev] | [next] | [standalone]


#72897

FromTim Chase <python.list@tim.thechases.com>
Date2014-06-06 21:20 -0500
Message-ID<mailman.10843.1402107662.18130.python-list@python.org>
In reply to#72614
On 2014-06-06 09:59, Travis Griggs wrote:
> On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
> > If you use UTF-8 for everything
> 
> It seems to me, that increasingly other libraries (C, etc), use
> utf8 as the preferred string interchange format.

I definitely advocate UTF-8 for any streaming scenario, as you're
iterating unidirectionally over the data anyways, so why use/transmit
more bytes than needed.  The only failing of UTF-8 that I've found in
the real world(*) is when you have to requirement of constant-time
indexing into strings.

-tkc



[toc] | [prev] | [next] | [standalone]


#73117

Fromwxjmfauth@gmail.com
Date2014-06-10 12:27 -0700
Message-ID<ac6d6893-d11b-4c8e-bf38-5f4200fcc163@googlegroups.com>
In reply to#72897
Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit :
> On 2014-06-06 09:59, Travis Griggs wrote:
> 
> > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
> 
> > > If you use UTF-8 for everything
> 
> > 
> 
> > It seems to me, that increasingly other libraries (C, etc), use
> 
> > utf8 as the preferred string interchange format.
> 
> 
> 
> I definitely advocate UTF-8 for any streaming scenario, as you're
> 
> iterating unidirectionally over the data anyways, so why use/transmit
> 
> more bytes than needed.  The only failing of UTF-8 that I've found in
> 
> the real world(*) is when you have to requirement of constant-time
> 
> indexing into strings.
> 
> 
> 
> -tkc

And once again, just an illustration,

>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'")
[0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'")
[2.5541921791045183, 2.52434366066052, 2.5337417948967413]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = 'z'.encode('utf-8')")
[0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[0.9320969737165115, 0.9086006535332558, 0.9051715140790861]
>>> 
>>> 
>>> sys.getsizeof('abc'*1000 + '\u0fce')
6040
>>> sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8'))
3020
>>>


But you know, that's not the problem.

When a see a core developper discussing benchmarking,
when the same application using non ascii chars become
1, 2, 5, 10, 20 if not more, slower comparing to pure
ascii, I'm wondering if there is not a serious problem
somewhere.

(and also becoming slower that Py3.2)

BTW, very easy to explain.

I do not understand why the "free, open, what-you-wish-here, ... "
software is so often pushing to the adoption of serious
corporate products.

jmf

[toc] | [prev] | [next] | [standalone]


#72608

FromChris Angelico <rosuav@gmail.com>
Date2014-06-04 17:20 +1000
Message-ID<mailman.10686.1401866454.18130.python-list@python.org>
In reply to#72588
On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
>>> 1) Most or all Chinese and Japanese characters
>>
>> Dont know how you count 'most'
>>
>> | One possible rationale is the desire to limit the size of the full
>> | Unicode character set, where CJK characters as represented by discrete
>> | ideograms may approach or exceed 100,000 (while those required for
>> | ordinary literacy in any language are probably under 3,000). Version 1
>> | of Unicode was designed to fit into 16 bits and only 20,940 characters
>> | (32%) out of the possible 65,536 were reserved for these CJK Unified
>> | Ideographs. Later Unicode has been extended to 21 bits allowing many
>> | more CJK characters (75,960 are assigned, with room for more).
>>
>> | From http://en.wikipedia.org/wiki/Han_unification
>
> So there are 20,940 CJK characters in the BMP, and approximately
> 55,000 more in the SIP.  I'd count 55,000 out of 75,960 as "most".

And I said "or all" because I have this vague notion that either NFC
or NFD pushes stuff out of the BMP, although I may be wrong on that.
But certainly 55K/75K "with room for more" is the "most" that I was
talking about. (Maybe it isn't "most" by usage. After all, hypertext
documents are usually smaller in UTF-8 than in UTF-16, despite "most
characters" (counting purely by 21-bit space in codepoints) being more
compact in UTF-16; most by usage is of ASCII, because hypertext
involves a lot of punctuation and such. But still, there are a lot of
CJK that aren't in the BMP.)

ChrisA

[toc] | [prev] | [next] | [standalone]


#72615

FromWolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date2014-06-04 10:00 +0200
Message-ID<mailman.10688.1401868854.18130.python-list@python.org>
In reply to#72588
On 04.06.2014 09:16, Chris Angelico wrote:
> The point is
> not that you might be able to get away with sticking your head in the
> sand and wishing Unicode would just go away. Even if you can, it's not
> something Python 3 can ever do.
>

Exactly. These endless discussions about different encodings start to 
get really boring. I cannot think of any aspect of it that hasn't been 
discussed here on several occasions, but as a fact:

"Strings are immutable sequences of Unicode code points" in Python3 
(https://docs.python.org/3/library/stdtypes.html?highlight=str#textseq) 
and this is not an implementation detail. So if any "implementation" 
doesn't stick to this convention, it is simply incomplete.

> And I don't think anybody can, anyway. If your device is big enough to
> hold Python, it should be big enough to handle Unicode; and then you
> don't have to say "Oh, sorry rest-of-the-world, this only works in
> English... and only a subset of English... and stuff".
>

Wolfgang

[toc] | [prev] | [next] | [standalone]


#72652

FromRoy Smith <roy@panix.com>
Date2014-06-04 14:42 -0400
Message-ID<roy-77A8E4.14420604062014@news.panix.com>
In reply to#72583
In article <mailman.10673.1401853976.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> You can't ignore those. You might be able to say "Well, my program
> will run slower if you throw these at it", but if you're going down
> that route, you probably want the full FSR and the advantages it
> confers on ASCII and Latin-1 strings. Binding your program to BMP-only
> is nearly as dangerous as binding it to ASCII-only; potentially worse,
> because you can run an awful lot of artificial tests without
> remembering to stick in some astral characters.

Yup.  I wrote a while(*) back about the pain I was having importing some 
data into a MySQL(**) database which (unknown to me when I started) only 
handled BMP.  It turns out in the entire dataset of 20-odd million 
records, there were exactly four that had astral characters.  All of my 
tests worked.  I didn't discover the problem until it blew up many hours 
into the "final" production import run.

(*) Two years?

(**) This was not the only pain point with MySQL.  We eventually 
switched to Postgress.

[toc] | [prev] | [next] | [standalone]


#72667

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-04 19:06 -0700
Message-ID<f935e85f-f86a-4821-86ab-3ab7e5e216d7@googlegroups.com>
In reply to#72652
On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
>  Chris Angelico  wrote:

> > You can't ignore those. You might be able to say "Well, my program
> > will run slower if you throw these at it", but if you're going down
> > that route, you probably want the full FSR and the advantages it
> > confers on ASCII and Latin-1 strings. Binding your program to BMP-only
> > is nearly as dangerous as binding it to ASCII-only; potentially worse,
> > because you can run an awful lot of artificial tests without
> > remembering to stick in some astral characters.

> Yup.  I wrote a while(*) back about the pain I was having importing some 
> data into a MySQL(**) database which (unknown to me when I started) only 
> handled BMP.  It turns out in the entire dataset of 20-odd million 
> records, there were exactly four that had astral characters.  All of my 
> tests worked.  I didn't discover the problem until it blew up many hours 
> into the "final" production import run.

> (*) Two years?

> (**) This was not the only pain point with MySQL.  We eventually 
> switched to Postgress.

Thanks Roy for bringing up that example - I was trying to recollect
the details.  I forgot about the MySQL angle which adds a different
twist to it.

Here's my interpretation of that situation; I'd like to hear yours:

Basic problem was that MySQL handled a strict subset of what the rest
of the system (Python 2.7?)  could handle.  This meant that at a late
(and embarrassing) stage, exceptions were being thrown, from deep
within the system.

OTOH, let's say you could detect the 'error' (more correctly
'un-handle-able') at the borders of your system, say when the user
enters the data on a web-form. Would you have a problem kicking out
those characters (in both senses!) with a curt:

"Cant deal with all this supra-galactic rubble!" ?

Of course switching to postgres may be a sound choice on other fronts.
But if that were not an option, and you only had these choices:

- significantly complexify your MySQL data structures to handle 4 in
  20 million cases
- just detect and throw such cases out at the outset

which would you take?

In any case this is the choice I hear from the micropython folks
who are explicitly seeking a cutdown version of python

[toc] | [prev] | [next] | [standalone]


#72705

FromRoy Smith <roy@panix.com>
Date2014-06-05 09:59 -0400
Message-ID<roy-A7AB97.09590305062014@news.panix.com>
In reply to#72667
In article <f935e85f-f86a-4821-86ab-3ab7e5e216d7@googlegroups.com>,
 Rustom Mody <rustompmody@gmail.com> wrote:

> On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
> > Yup.  I wrote a while(*) back about the pain I was having importing some 
> > data into a MySQL(**) database

> Here's my interpretation of that situation; I'd like to hear yours:
> 
> Basic problem was that MySQL handled a strict subset of what the rest
> of the system (Python 2.7?)  could handle.

Yes.  This was not a Python issue.  I was just responding to ChrisA's 
statement:

>>> Binding your program to BMP-only is nearly as dangerous as binding 
>>> it to ASCII-only; potentially worse, because you can run an awful 
>>> lot of artificial tests without remembering to stick in some astral 
>>> characters.


> Of course switching to postgres may be a sound choice on other fronts.
> But if that were not an option, and you only had these choices:
> 
> - significantly complexify your MySQL data structures to handle 4 in
>   20 million cases
> - just detect and throw such cases out at the outset
> 
> which would you take?

It turns out, we could have upgraded to a newer version of MySQL, which 
did handle astral characters correctly.  But, what we did was discarded 
the records containing non-BMP data.  Of course, that's a decision that 
can only be made when you understand the business requirements.  In our 
case, discarding those four records had no impact on our business, so it 
made sense.  For other people, not having the full dataset might have 
been a fatal problem.

This was just one of many MySQL problems we ran into.  Eventually, we 
decided it wasn't worth fighting with what was obviously a brain-dead 
system, and switched databases.

[toc] | [prev] | [next] | [standalone]


#72711

FromChris Angelico <rosuav@gmail.com>
Date2014-06-06 01:33 +1000
Message-ID<mailman.10738.1401982394.18130.python-list@python.org>
In reply to#72705
On Thu, Jun 5, 2014 at 11:59 PM, Roy Smith <roy@panix.com> wrote:
> It turns out, we could have upgraded to a newer version of MySQL, which
> did handle astral characters correctly.  But, what we did was discarded
> the records containing non-BMP data.  Of course, that's a decision that
> can only be made when you understand the business requirements.  In our
> case, discarding those four records had no impact on our business, so it
> made sense.  For other people, not having the full dataset might have
> been a fatal problem.
>
> This was just one of many MySQL problems we ran into.  Eventually, we
> decided it wasn't worth fighting with what was obviously a brain-dead
> system, and switched databases.

Point to note: It's not just "Avoid MySQL version x.y.z, it's buggy",
but "Make sure you're on a sufficiently new version of MySQL *and then
use these settings*". For instance, the MySQL "utf8"
locale/collation/charset (not sure what it calls it) supports only the
BMP; you have to use "utf8mb4", which is UTF-8 that's allowed to go as
far as four bytes long.

What were they thinking?

What, were they thinking?

I understand there's now an alias "utf8mb3" for the buggy utf8, with
some theory that some future version of MySQL might make utf8 become
an alias for utf8mb4. But when would you ever actually *demand* this
buggy behaviour? Why not just say "as of this version, utf8 is
identical to utf8mb4, which was a superset thereof", and if anything
changes or breaks, just acknowledge that it used to be buggy?

</rant>

Use PostgreSQL.

</obvious>

ChrisA

[toc] | [prev] | [next] | [standalone]


#72595

FromSteven D'Aprano <steve@pearwood.info>
Date2014-06-04 05:20 +0000
Message-ID<538eac94$0$11109$c3e8da3@news.astraweb.com>
In reply to#72582
On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:

> On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
> 
>> With that in mind, I, as many others, think that forcing Unicode bloat
>> upon people by default is the most controversial feature of Python3.
>> The reason is that you go very long way dealing with languages of the
>> people of the world by just treating strings as consisting of 8-bit
>> data. I'd say, that's enough for 90% of applications. Unicode is needed
>> only if one needs to deal with multiple languages *at the same time*,
>> which is fairly rare (remaining 10% of apps).
> 
>> And please keep in mind that MicroPython was originally intended (and
>> should be remain scalable down to) an MCU. Unicode needed there is even
>> less, and even less resources to support Unicode just because.
> 
> At some time (when jmf was making more intelligible noises) I had
> suggested that the choice between 1/2/4 byte strings that happens at
> runtime in python3's FSR can be made at python-start time with a
> command-line switch.  There are many combinations here; here is one in
> more detail:
> 
> Instead of having one (FSR) string engine, you have (upto) 4
> 
> - a pure 1 byte (ASCII)

There are only 128 ASCII characters, so a pure ASCII implementation 
cannot even represent arbitrary bytes.


> - a pure 2 byte (BMP) with decode-failures for out-of-ranges

That's not Unicode. It's a subset of Unicode.


> - a pure 4 byte -- everything UTF-32

For embedded devices, that would be extremely memory hungry. Remember, 
every variable, every attribute name, every method and class and function 
name is a string. Using at least 56 bytes just to refer to 
sys.stdout.write will be painful.


> - FSR dynamic switching at runtime (with massive moping from the world's
> jmfs)

Please stop giving JMF's crackpot opinion even the dignity of being 
sneered at.

[...]
> 2. My casual/cursory reading of the contents of the SMP-planes suggests
> that the stuff there is are things like - egyptian hieroplyphics
> - mahjong characters
> - ancient greek musical symbols
> - alchemical symbols etc etc.
> 
> IOW from pov of a universallly acceptable character set this is mostly
> rubbish

Certainly some of these things are more whimsical than practical, but it 
doesn't really matter. Even if you strip out every bit of whimsy from the 
Unicode character set, you're still left with needing more than 65536 
characters (16 bits). For efficiency you aren't going to use 17 bits, or 
18, or 19, so it's actually faster and more efficient to jump right to 32 
bits. For technical reasons which I don't fully understand, Unicode only 
uses 21 of those 32 bits, giving a total of 1114112 available code 
points. Whether you or I personally have need for alchemical symbols, 
*some people* do, and supporting their use-case doesn't harm us by one 
bit.


> And so a pure BMP-supporting implementation may be a reasonable
> compromise. [As long as no surrogate-pairs are there]

At the cost on one extra bit, strings could use UTF-16 internally and 
still have correct behaviour. The bit could be a flag recording whether 
the string contains any surrogate pairs. If the flag was 0, all string 
operations could assume a constant 2-bytes-per-character. If the flag was 
1, it could fall back to walking the string checking for surrogate pairs.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#72597

FromRustom Mody <rustompmody@gmail.com>
Date2014-06-03 22:36 -0700
Message-ID<f0a2d25f-3480-4ebc-b41e-603a77b3451d@googlegroups.com>
In reply to#72595
On Wednesday, June 4, 2014 10:50:21 AM UTC+5:30, Steven D'Aprano wrote:
> On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:
> > And so a pure BMP-supporting implementation may be a reasonable
> > compromise. [As long as no surrogate-pairs are there]

> At the cost on one extra bit, strings could use UTF-16 internally and 
> still have correct behaviour. The bit could be a flag recording whether 
> the string contains any surrogate pairs. If the flag was 0, all string 
> operations could assume a constant 2-bytes-per-character. If the flag was 
> 1, it could fall back to walking the string checking for surrogate pairs.

Yes.  That could be one possibility.  My main reason in giving the
4-engine choice was not that 4 engines are a good idea but that in the
very differently constrained world of μ-controllers playing around with
alternate binding times may be advantageous


> > On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
> >> With that in mind, I, as many others, think that forcing Unicode bloat
> >> upon people by default is the most controversial feature of Python3.
> >> The reason is that you go very long way dealing with languages of the
> >> people of the world by just treating strings as consisting of 8-bit
> >> data. I'd say, that's enough for 90% of applications. Unicode is needed
> >> only if one needs to deal with multiple languages *at the same time*,
> >> which is fairly rare (remaining 10% of apps).
> >> And please keep in mind that MicroPython was originally intended (and
> >> should be remain scalable down to) an MCU. Unicode needed there is even
> >> less, and even less resources to support Unicode just because.
> > At some time (when jmf was making more intelligible noises) I had
> > suggested that the choice between 1/2/4 byte strings that happens at
> > runtime in python3's FSR can be made at python-start time with a
> > command-line switch.  There are many combinations here; here is one in
> > more detail:
> > Instead of having one (FSR) string engine, you have (upto) 4
> > - a pure 1 byte (ASCII)

> There are only 128 ASCII characters, so a pure ASCII implementation 
> cannot even represent arbitrary bytes.

Yes this is a subtle point.
I was initially going to write Latin-1. Wrote a rough-n-ready ASCII.
But maybe it could be a choice.

I really dont understand the binding-times of μ-controllers.

My impression is that actual development is split 
1 tinkering with the board
2 working on full powered computers and downloading to the board

In going from 2 to 1 heavy amounts of cut-downs are probably possible and
desirable. If this is the case, having hooks in the system for making choices may be a good idea
optimal choices may be worthwhile

[toc] | [prev] | [next] | [standalone]


#72599

FromIan Kelly <ian.g.kelly@gmail.com>
Date2014-06-03 23:55 -0600
Message-ID<mailman.10679.1401861637.18130.python-list@python.org>
In reply to#72595

[Multipart message — attachments visible in raw view] — view raw

On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:
> For technical reasons which I don't fully understand, Unicode only
> uses 21 of those 32 bits, giving a total of 1114112 available code
> points.

I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
sufficient to encode up to 16 supplementary planes, so if Unicode were
allowed to grow any larger than that, UTF-16 would no longer be able to
encode all codepoints.

Another benefit of fixing the size is that it frees the other 11 bits per
character of UTF-32 for packing in ancillary data.

[toc] | [prev] | [next] | [standalone]


#72604

FromTerry Reedy <tjreedy@udel.edu>
Date2014-06-04 03:00 -0400
Message-ID<mailman.10682.1401865221.18130.python-list@python.org>
In reply to#72595
On 6/4/2014 1:55 AM, Ian Kelly wrote:
>
> On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info
> <mailto:steve@pearwood.info>> wrote:
>  > For technical reasons which I don't fully understand, Unicode only
>  > uses 21 of those 32 bits, giving a total of 1114112 available code
>  > points.
>
> I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
> sufficient to encode up to 16 supplementary planes, so if Unicode were
> allowed to grow any larger than that, UTF-16 would no longer be able to
> encode all codepoints.

I believe the original utf-8 used up to 6 bytes per char to encode 2**32 
potential chars. Just 4 bytes limits to 2**21 and for whatever reason 
(easier decoding?), utf-8 was revised down (unusual ;-).

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#72605

FromChris Angelico <rosuav@gmail.com>
Date2014-06-04 17:10 +1000
Message-ID<mailman.10683.1401865837.18130.python-list@python.org>
In reply to#72595
On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 6/4/2014 1:55 AM, Ian Kelly wrote:
>>
>>
>> On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info
>> <mailto:steve@pearwood.info>> wrote:
>>  > For technical reasons which I don't fully understand, Unicode only
>>  > uses 21 of those 32 bits, giving a total of 1114112 available code
>>  > points.
>>
>> I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
>> sufficient to encode up to 16 supplementary planes, so if Unicode were
>> allowed to grow any larger than that, UTF-16 would no longer be able to
>> encode all codepoints.
>
>
> I believe the original utf-8 used up to 6 bytes per char to encode 2**32
> potential chars. Just 4 bytes limits to 2**21 and for whatever reason
> (easier decoding?), utf-8 was revised down (unusual ;-).

I understood it to be UTF-16's fault, per Ian's statement. That is to
say, the entire Unicode standard was warped around the problem that
some people were going around thinking "a character is 16 bits", even
though that's just as fallacious as "a character is 8 bits".

ChrisA

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web