Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #72553 > unrolled thread
| Started by | Paul Sokolovsky <pmiscml@gmail.com> |
|---|---|
| First post | 2014-06-04 00:41 +0300 |
| Last post | 2014-06-04 17:10 +1000 |
| Articles | 15 on this page of 35 — 15 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Sokolovsky <pmiscml@gmail.com> - 2014-06-04 00:41 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 20:37 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 13:52 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 21:40 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:02 -0600
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:16 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 07:42 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 00:58 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 11:06 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 06:01 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 14:57 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:25 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Paul Rubin <no.email@nospam.invalid> - 2014-06-04 11:25 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 12:53 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Marko Rauhamaa <marko@pacujo.net> - 2014-06-04 15:17 +0300
Re: Micro Python -- a lean and efficient implementation of Python 3 Robin Becker <robin@reportlab.com> - 2014-06-04 13:31 +0100
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-06-04 13:51 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 00:32 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 02:13 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-04 07:21 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 Travis Griggs <travisgriggs@gmail.com> - 2014-06-06 09:59 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-06 13:29 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Tim Chase <python.list@tim.thechases.com> - 2014-06-06 21:20 -0500
Re: Micro Python -- a lean and efficient implementation of Python 3 wxjmfauth@gmail.com - 2014-06-10 12:27 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:20 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2014-06-04 10:00 +0200
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-04 14:42 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-04 19:06 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Roy Smith <roy@panix.com> - 2014-06-05 09:59 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-06 01:33 +1000
Re: Micro Python -- a lean and efficient implementation of Python 3 Steven D'Aprano <steve@pearwood.info> - 2014-06-04 05:20 +0000
Re: Micro Python -- a lean and efficient implementation of Python 3 Rustom Mody <rustompmody@gmail.com> - 2014-06-03 22:36 -0700
Re: Micro Python -- a lean and efficient implementation of Python 3 Ian Kelly <ian.g.kelly@gmail.com> - 2014-06-03 23:55 -0600
Re: Micro Python -- a lean and efficient implementation of Python 3 Terry Reedy <tjreedy@udel.edu> - 2014-06-04 03:00 -0400
Re: Micro Python -- a lean and efficient implementation of Python 3 Chris Angelico <rosuav@gmail.com> - 2014-06-04 17:10 +1000
Page 2 of 2 — ← Prev page 1 [2]
| From | Travis Griggs <travisgriggs@gmail.com> |
|---|---|
| Date | 2014-06-06 09:59 -0700 |
| Message-ID | <mailman.10822.1402073958.18130.python-list@python.org> |
| In reply to | #72614 |
On Jun 4, 2014, at 4:01 AM, Tim Chase <python.list@tim.thechases.com> wrote: > If you use UTF-8 for everything It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. It’s universal, not prone to endian issues, etc. So one *advantage* you gain for using utf8 internally, is any time you need to hand a string to an external thing, it’s just ready. An app that reserves its internal string processing to streaming based ones but has to to hand strings to external libraries a lot (e.g. cairo) might actually benefit using utf8 internally, because a) it’s not doing the linear search for the odd character address and b) it no longer needs to decode/encode every time it sends or receives a string to an external library.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-06 13:29 -0400 |
| Message-ID | <roy-4FAEBF.13291606062014@news.panix.com> |
| In reply to | #72861 |
In article <mailman.10822.1402073958.18130.python-list@python.org>, Travis Griggs <travisgriggs@gmail.com> wrote: > On Jun 4, 2014, at 4:01 AM, Tim Chase <python.list@tim.thechases.com> wrote: > > > If you use UTF-8 for everything > > It seems to me, that increasingly other libraries (C, etc), use utf8 as the > preferred string interchange format. Itąs universal, not prone to endian > issues, etc. One of the important etc factors is, "Since it's the most commonly used, it's the one that other people are most likely to have implemented correctly". In the real world, these are important considerations.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-06-06 21:20 -0500 |
| Message-ID | <mailman.10843.1402107662.18130.python-list@python.org> |
| In reply to | #72614 |
On 2014-06-06 09:59, Travis Griggs wrote: > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote: > > If you use UTF-8 for everything > > It seems to me, that increasingly other libraries (C, etc), use > utf8 as the preferred string interchange format. I definitely advocate UTF-8 for any streaming scenario, as you're iterating unidirectionally over the data anyways, so why use/transmit more bytes than needed. The only failing of UTF-8 that I've found in the real world(*) is when you have to requirement of constant-time indexing into strings. -tkc
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-06-10 12:27 -0700 |
| Message-ID | <ac6d6893-d11b-4c8e-bf38-5f4200fcc163@googlegroups.com> |
| In reply to | #72897 |
Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit :
> On 2014-06-06 09:59, Travis Griggs wrote:
>
> > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
>
> > > If you use UTF-8 for everything
>
> >
>
> > It seems to me, that increasingly other libraries (C, etc), use
>
> > utf8 as the preferred string interchange format.
>
>
>
> I definitely advocate UTF-8 for any streaming scenario, as you're
>
> iterating unidirectionally over the data anyways, so why use/transmit
>
> more bytes than needed. The only failing of UTF-8 that I've found in
>
> the real world(*) is when you have to requirement of constant-time
>
> indexing into strings.
>
>
>
> -tkc
And once again, just an illustration,
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'")
[0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'")
[2.5541921791045183, 2.52434366066052, 2.5337417948967413]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = 'z'.encode('utf-8')")
[0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[0.9320969737165115, 0.9086006535332558, 0.9051715140790861]
>>>
>>>
>>> sys.getsizeof('abc'*1000 + '\u0fce')
6040
>>> sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8'))
3020
>>>
But you know, that's not the problem.
When a see a core developper discussing benchmarking,
when the same application using non ascii chars become
1, 2, 5, 10, 20 if not more, slower comparing to pure
ascii, I'm wondering if there is not a serious problem
somewhere.
(and also becoming slower that Py3.2)
BTW, very easy to explain.
I do not understand why the "free, open, what-you-wish-here, ... "
software is so often pushing to the adoption of serious
corporate products.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-04 17:20 +1000 |
| Message-ID | <mailman.10686.1401866454.18130.python-list@python.org> |
| In reply to | #72588 |
On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody <rustompmody@gmail.com> wrote: >>> 1) Most or all Chinese and Japanese characters >> >> Dont know how you count 'most' >> >> | One possible rationale is the desire to limit the size of the full >> | Unicode character set, where CJK characters as represented by discrete >> | ideograms may approach or exceed 100,000 (while those required for >> | ordinary literacy in any language are probably under 3,000). Version 1 >> | of Unicode was designed to fit into 16 bits and only 20,940 characters >> | (32%) out of the possible 65,536 were reserved for these CJK Unified >> | Ideographs. Later Unicode has been extended to 21 bits allowing many >> | more CJK characters (75,960 are assigned, with room for more). >> >> | From http://en.wikipedia.org/wiki/Han_unification > > So there are 20,940 CJK characters in the BMP, and approximately > 55,000 more in the SIP. I'd count 55,000 out of 75,960 as "most". And I said "or all" because I have this vague notion that either NFC or NFD pushes stuff out of the BMP, although I may be wrong on that. But certainly 55K/75K "with room for more" is the "most" that I was talking about. (Maybe it isn't "most" by usage. After all, hypertext documents are usually smaller in UTF-8 than in UTF-16, despite "most characters" (counting purely by 21-bit space in codepoints) being more compact in UTF-16; most by usage is of ASCII, because hypertext involves a lot of punctuation and such. But still, there are a lot of CJK that aren't in the BMP.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> |
|---|---|
| Date | 2014-06-04 10:00 +0200 |
| Message-ID | <mailman.10688.1401868854.18130.python-list@python.org> |
| In reply to | #72588 |
On 04.06.2014 09:16, Chris Angelico wrote: > The point is > not that you might be able to get away with sticking your head in the > sand and wishing Unicode would just go away. Even if you can, it's not > something Python 3 can ever do. > Exactly. These endless discussions about different encodings start to get really boring. I cannot think of any aspect of it that hasn't been discussed here on several occasions, but as a fact: "Strings are immutable sequences of Unicode code points" in Python3 (https://docs.python.org/3/library/stdtypes.html?highlight=str#textseq) and this is not an implementation detail. So if any "implementation" doesn't stick to this convention, it is simply incomplete. > And I don't think anybody can, anyway. If your device is big enough to > hold Python, it should be big enough to handle Unicode; and then you > don't have to say "Oh, sorry rest-of-the-world, this only works in > English... and only a subset of English... and stuff". > Wolfgang
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-04 14:42 -0400 |
| Message-ID | <roy-77A8E4.14420604062014@news.panix.com> |
| In reply to | #72583 |
In article <mailman.10673.1401853976.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > You can't ignore those. You might be able to say "Well, my program > will run slower if you throw these at it", but if you're going down > that route, you probably want the full FSR and the advantages it > confers on ASCII and Latin-1 strings. Binding your program to BMP-only > is nearly as dangerous as binding it to ASCII-only; potentially worse, > because you can run an awful lot of artificial tests without > remembering to stick in some astral characters. Yup. I wrote a while(*) back about the pain I was having importing some data into a MySQL(**) database which (unknown to me when I started) only handled BMP. It turns out in the entire dataset of 20-odd million records, there were exactly four that had astral characters. All of my tests worked. I didn't discover the problem until it blew up many hours into the "final" production import run. (*) Two years? (**) This was not the only pain point with MySQL. We eventually switched to Postgress.
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-04 19:06 -0700 |
| Message-ID | <f935e85f-f86a-4821-86ab-3ab7e5e216d7@googlegroups.com> |
| In reply to | #72652 |
On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote: > Chris Angelico wrote: > > You can't ignore those. You might be able to say "Well, my program > > will run slower if you throw these at it", but if you're going down > > that route, you probably want the full FSR and the advantages it > > confers on ASCII and Latin-1 strings. Binding your program to BMP-only > > is nearly as dangerous as binding it to ASCII-only; potentially worse, > > because you can run an awful lot of artificial tests without > > remembering to stick in some astral characters. > Yup. I wrote a while(*) back about the pain I was having importing some > data into a MySQL(**) database which (unknown to me when I started) only > handled BMP. It turns out in the entire dataset of 20-odd million > records, there were exactly four that had astral characters. All of my > tests worked. I didn't discover the problem until it blew up many hours > into the "final" production import run. > (*) Two years? > (**) This was not the only pain point with MySQL. We eventually > switched to Postgress. Thanks Roy for bringing up that example - I was trying to recollect the details. I forgot about the MySQL angle which adds a different twist to it. Here's my interpretation of that situation; I'd like to hear yours: Basic problem was that MySQL handled a strict subset of what the rest of the system (Python 2.7?) could handle. This meant that at a late (and embarrassing) stage, exceptions were being thrown, from deep within the system. OTOH, let's say you could detect the 'error' (more correctly 'un-handle-able') at the borders of your system, say when the user enters the data on a web-form. Would you have a problem kicking out those characters (in both senses!) with a curt: "Cant deal with all this supra-galactic rubble!" ? Of course switching to postgres may be a sound choice on other fronts. But if that were not an option, and you only had these choices: - significantly complexify your MySQL data structures to handle 4 in 20 million cases - just detect and throw such cases out at the outset which would you take? In any case this is the choice I hear from the micropython folks who are explicitly seeking a cutdown version of python
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-05 09:59 -0400 |
| Message-ID | <roy-A7AB97.09590305062014@news.panix.com> |
| In reply to | #72667 |
In article <f935e85f-f86a-4821-86ab-3ab7e5e216d7@googlegroups.com>, Rustom Mody <rustompmody@gmail.com> wrote: > On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote: > > Yup. I wrote a while(*) back about the pain I was having importing some > > data into a MySQL(**) database > Here's my interpretation of that situation; I'd like to hear yours: > > Basic problem was that MySQL handled a strict subset of what the rest > of the system (Python 2.7?) could handle. Yes. This was not a Python issue. I was just responding to ChrisA's statement: >>> Binding your program to BMP-only is nearly as dangerous as binding >>> it to ASCII-only; potentially worse, because you can run an awful >>> lot of artificial tests without remembering to stick in some astral >>> characters. > Of course switching to postgres may be a sound choice on other fronts. > But if that were not an option, and you only had these choices: > > - significantly complexify your MySQL data structures to handle 4 in > 20 million cases > - just detect and throw such cases out at the outset > > which would you take? It turns out, we could have upgraded to a newer version of MySQL, which did handle astral characters correctly. But, what we did was discarded the records containing non-BMP data. Of course, that's a decision that can only be made when you understand the business requirements. In our case, discarding those four records had no impact on our business, so it made sense. For other people, not having the full dataset might have been a fatal problem. This was just one of many MySQL problems we ran into. Eventually, we decided it wasn't worth fighting with what was obviously a brain-dead system, and switched databases.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-06 01:33 +1000 |
| Message-ID | <mailman.10738.1401982394.18130.python-list@python.org> |
| In reply to | #72705 |
On Thu, Jun 5, 2014 at 11:59 PM, Roy Smith <roy@panix.com> wrote: > It turns out, we could have upgraded to a newer version of MySQL, which > did handle astral characters correctly. But, what we did was discarded > the records containing non-BMP data. Of course, that's a decision that > can only be made when you understand the business requirements. In our > case, discarding those four records had no impact on our business, so it > made sense. For other people, not having the full dataset might have > been a fatal problem. > > This was just one of many MySQL problems we ran into. Eventually, we > decided it wasn't worth fighting with what was obviously a brain-dead > system, and switched databases. Point to note: It's not just "Avoid MySQL version x.y.z, it's buggy", but "Make sure you're on a sufficiently new version of MySQL *and then use these settings*". For instance, the MySQL "utf8" locale/collation/charset (not sure what it calls it) supports only the BMP; you have to use "utf8mb4", which is UTF-8 that's allowed to go as far as four bytes long. What were they thinking? What, were they thinking? I understand there's now an alias "utf8mb3" for the buggy utf8, with some theory that some future version of MySQL might make utf8 become an alias for utf8mb4. But when would you ever actually *demand* this buggy behaviour? Why not just say "as of this version, utf8 is identical to utf8mb4, which was a superset thereof", and if anything changes or breaks, just acknowledge that it used to be buggy? </rant> Use PostgreSQL. </obvious> ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-06-04 05:20 +0000 |
| Message-ID | <538eac94$0$11109$c3e8da3@news.astraweb.com> |
| In reply to | #72582 |
On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote: > On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: > >> With that in mind, I, as many others, think that forcing Unicode bloat >> upon people by default is the most controversial feature of Python3. >> The reason is that you go very long way dealing with languages of the >> people of the world by just treating strings as consisting of 8-bit >> data. I'd say, that's enough for 90% of applications. Unicode is needed >> only if one needs to deal with multiple languages *at the same time*, >> which is fairly rare (remaining 10% of apps). > >> And please keep in mind that MicroPython was originally intended (and >> should be remain scalable down to) an MCU. Unicode needed there is even >> less, and even less resources to support Unicode just because. > > At some time (when jmf was making more intelligible noises) I had > suggested that the choice between 1/2/4 byte strings that happens at > runtime in python3's FSR can be made at python-start time with a > command-line switch. There are many combinations here; here is one in > more detail: > > Instead of having one (FSR) string engine, you have (upto) 4 > > - a pure 1 byte (ASCII) There are only 128 ASCII characters, so a pure ASCII implementation cannot even represent arbitrary bytes. > - a pure 2 byte (BMP) with decode-failures for out-of-ranges That's not Unicode. It's a subset of Unicode. > - a pure 4 byte -- everything UTF-32 For embedded devices, that would be extremely memory hungry. Remember, every variable, every attribute name, every method and class and function name is a string. Using at least 56 bytes just to refer to sys.stdout.write will be painful. > - FSR dynamic switching at runtime (with massive moping from the world's > jmfs) Please stop giving JMF's crackpot opinion even the dignity of being sneered at. [...] > 2. My casual/cursory reading of the contents of the SMP-planes suggests > that the stuff there is are things like - egyptian hieroplyphics > - mahjong characters > - ancient greek musical symbols > - alchemical symbols etc etc. > > IOW from pov of a universallly acceptable character set this is mostly > rubbish Certainly some of these things are more whimsical than practical, but it doesn't really matter. Even if you strip out every bit of whimsy from the Unicode character set, you're still left with needing more than 65536 characters (16 bits). For efficiency you aren't going to use 17 bits, or 18, or 19, so it's actually faster and more efficient to jump right to 32 bits. For technical reasons which I don't fully understand, Unicode only uses 21 of those 32 bits, giving a total of 1114112 available code points. Whether you or I personally have need for alchemical symbols, *some people* do, and supporting their use-case doesn't harm us by one bit. > And so a pure BMP-supporting implementation may be a reasonable > compromise. [As long as no surrogate-pairs are there] At the cost on one extra bit, strings could use UTF-16 internally and still have correct behaviour. The bit could be a flag recording whether the string contains any surrogate pairs. If the flag was 0, all string operations could assume a constant 2-bytes-per-character. If the flag was 1, it could fall back to walking the string checking for surrogate pairs. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-06-03 22:36 -0700 |
| Message-ID | <f0a2d25f-3480-4ebc-b41e-603a77b3451d@googlegroups.com> |
| In reply to | #72595 |
On Wednesday, June 4, 2014 10:50:21 AM UTC+5:30, Steven D'Aprano wrote: > On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote: > > And so a pure BMP-supporting implementation may be a reasonable > > compromise. [As long as no surrogate-pairs are there] > At the cost on one extra bit, strings could use UTF-16 internally and > still have correct behaviour. The bit could be a flag recording whether > the string contains any surrogate pairs. If the flag was 0, all string > operations could assume a constant 2-bytes-per-character. If the flag was > 1, it could fall back to walking the string checking for surrogate pairs. Yes. That could be one possibility. My main reason in giving the 4-engine choice was not that 4 engines are a good idea but that in the very differently constrained world of μ-controllers playing around with alternate binding times may be advantageous > > On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: > >> With that in mind, I, as many others, think that forcing Unicode bloat > >> upon people by default is the most controversial feature of Python3. > >> The reason is that you go very long way dealing with languages of the > >> people of the world by just treating strings as consisting of 8-bit > >> data. I'd say, that's enough for 90% of applications. Unicode is needed > >> only if one needs to deal with multiple languages *at the same time*, > >> which is fairly rare (remaining 10% of apps). > >> And please keep in mind that MicroPython was originally intended (and > >> should be remain scalable down to) an MCU. Unicode needed there is even > >> less, and even less resources to support Unicode just because. > > At some time (when jmf was making more intelligible noises) I had > > suggested that the choice between 1/2/4 byte strings that happens at > > runtime in python3's FSR can be made at python-start time with a > > command-line switch. There are many combinations here; here is one in > > more detail: > > Instead of having one (FSR) string engine, you have (upto) 4 > > - a pure 1 byte (ASCII) > There are only 128 ASCII characters, so a pure ASCII implementation > cannot even represent arbitrary bytes. Yes this is a subtle point. I was initially going to write Latin-1. Wrote a rough-n-ready ASCII. But maybe it could be a choice. I really dont understand the binding-times of μ-controllers. My impression is that actual development is split 1 tinkering with the board 2 working on full powered computers and downloading to the board In going from 2 to 1 heavy amounts of cut-downs are probably possible and desirable. If this is the case, having hooks in the system for making choices may be a good idea optimal choices may be worthwhile
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-06-03 23:55 -0600 |
| Message-ID | <mailman.10679.1401861637.18130.python-list@python.org> |
| In reply to | #72595 |
[Multipart message — attachments visible in raw view] — view raw
On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info> wrote: > For technical reasons which I don't fully understand, Unicode only > uses 21 of those 32 bits, giving a total of 1114112 available code > points. I think mainly it's to accommodate UTF-16. The surrogate pair scheme is sufficient to encode up to 16 supplementary planes, so if Unicode were allowed to grow any larger than that, UTF-16 would no longer be able to encode all codepoints. Another benefit of fixing the size is that it frees the other 11 bits per character of UTF-32 for packing in ancillary data.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-06-04 03:00 -0400 |
| Message-ID | <mailman.10682.1401865221.18130.python-list@python.org> |
| In reply to | #72595 |
On 6/4/2014 1:55 AM, Ian Kelly wrote: > > On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info > <mailto:steve@pearwood.info>> wrote: > > For technical reasons which I don't fully understand, Unicode only > > uses 21 of those 32 bits, giving a total of 1114112 available code > > points. > > I think mainly it's to accommodate UTF-16. The surrogate pair scheme is > sufficient to encode up to 16 supplementary planes, so if Unicode were > allowed to grow any larger than that, UTF-16 would no longer be able to > encode all codepoints. I believe the original utf-8 used up to 6 bytes per char to encode 2**32 potential chars. Just 4 bytes limits to 2**21 and for whatever reason (easier decoding?), utf-8 was revised down (unusual ;-). -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-06-04 17:10 +1000 |
| Message-ID | <mailman.10683.1401865837.18130.python-list@python.org> |
| In reply to | #72595 |
On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy <tjreedy@udel.edu> wrote: > On 6/4/2014 1:55 AM, Ian Kelly wrote: >> >> >> On Jun 3, 2014 11:27 PM, "Steven D'Aprano" <steve@pearwood.info >> <mailto:steve@pearwood.info>> wrote: >> > For technical reasons which I don't fully understand, Unicode only >> > uses 21 of those 32 bits, giving a total of 1114112 available code >> > points. >> >> I think mainly it's to accommodate UTF-16. The surrogate pair scheme is >> sufficient to encode up to 16 supplementary planes, so if Unicode were >> allowed to grow any larger than that, UTF-16 would no longer be able to >> encode all codepoints. > > > I believe the original utf-8 used up to 6 bytes per char to encode 2**32 > potential chars. Just 4 bytes limits to 2**21 and for whatever reason > (easier decoding?), utf-8 was revised down (unusual ;-). I understood it to be UTF-16's fault, per Ian's statement. That is to say, the entire Unicode standard was warped around the problem that some people were going around thinking "a character is 16 bits", even though that's just as fallacious as "a character is 8 bits". ChrisA
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web