Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 20 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-06 11:58 -0800 |
| Message-ID | <87d1076d-4b71-4705-8e5b-ef58c5086bcd@googlegroups.com> |
| In reply to | #87055 |
Le vendredi 6 mars 2015 20:41:36 UTC+1, wxjm...@gmail.com a écrit :
> Le vendredi 6 mars 2015 17:21:10 UTC+1, Rustom Mody a écrit :
> > On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> > > Rustom Mody wrote:
> > >
> > > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
> > >
> > > [snip example of an analogous situation with NULs]
> > >
> > > > Strawman.
> > >
> > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> > > they really should say is "Yes, that's a good argument, I'm afraid I can't
> > > argue against it, at least not without considerable thought", I'd be a
> > > wealthy man...
> >
> > Missed my addition? Here it is again – grammar slightly corrected.
> >
> > ===========
> > Ah well if you insist on pursuing the nul-char example...
> > - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
> >
> > - No, the code that "can't cope with a perfectly normal character" is not wrong
> >
> > - It is C that is wrong for designing a buggy string data structure that cannot
> > contain a valid char.
> > ===========
> >
> > In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
> > it is perhaps too strong even for me.
> >
> > To elaborate:
> > Take the buggy-plane analogy I gave in
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> >
> > If a plane model crashes once in 10,000 flights compared to others that crash once in
> > one million flights we can call it bug-prone though not strictly buggy – it does fly
> > 9999 times safely!
> > OTOH if a plane is guaranteed to crash we can all it a buggy plane.
> >
> > C's string is not bug-prone its plain buggy as it cannot represent strings
> > with nulls.
> >
> > I would not go that far for UTF-16.
> > It is bug-inviting but it can also be implemented correctly
> > >
> > >
> > > > Lets please stick to UTF-16 shall we?
> > > >
> > > > Now tell me:
> > > > - Is it broken or not?
> > >
> > > The UTF-16 standard is not broken. It is a perfectly adequate variable-width
> > > encoding, and considerably better than most other variable-width encodings.
> > >
> > > However, many implementations of UTF-16 are faulty, and assume a
> > > fixed-width. *That* is broken, not UTF-16.
> > >
> > > (The difference between specification and implementation is critical.)
> > >
> > >
> > > > - Is it widely used or not?
> > >
> > > It's quite widely used.
> > >
> > >
> > > > - Should programmers be careful of it or not?
> > >
> > > Programmers should be aware whether or not any specific language uses UTF-16
> > > and whether the implementation is buggy. That will help them decide whether
> > > or not to use that language.
> > >
> > >
> > > > - Should programmers be warned about it or not?
> > >
> > > I'm in favour of people having more knowledge rather than less. I don't
> > > believe that ignorance is bliss, except perhaps in the case that a giant
> > > asteroid the size of Texas is heading straight for us.
> > >
> > > Programmers should be aware of the limitations or bugs in any UTF-16
> > > implementation they are likely to run into. Hence my general
> > > recommendation:
> > >
> > > - For transmission over networks or storage on permanent media (e.g. the
> > > content of text files), use UTF-8. It is well-implemented by nearly all
> > > languages that support Unicode, as far as I know.
> > >
> > > - If you are designing your own language, your implementation of Unicode
> > > strings should use something like Python's FSR, or UTF-8 with tweaks to
> > > make string indexing O(1) rather than O(N), or correctly-implemented
> > > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
> >
> > FSR is possible in python for very specific pythonic reasons
> > - dynamicness
> > - immutable strings
> >
> > Drop either and FSR is impossible
> >
> > > If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed
> > > 2-byte per code point format, you fail.
> >
> > Seems obvious enough.
> > So lets see...
> > Here's a 2-line python program -- runs well enough when run as a command.
> > Program:
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> >
> > $ idle3 ff.py
> > Traceback (most recent call last):
> > File "/usr/bin/idle3", line 5, in <module>
> > main()
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> > if flist.open(filename) is None:
> > File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> > edit = self.EditorWindow(self, filename, key)
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> > EditorWindow.__init__(self, *args)
> > File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> > if io.loadfile(filename):
> > File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> > self.text.insert("1.0", chars)
> > File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> > self.top.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> > self.addcmd(InsertCommand(index, chars, tags))
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> > cmd.do(self.delegate)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> > text.insert(self.index1, self.chars, self.tags)
> > File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> > self.delegate.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> > return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> >
> > So who/what is broken?
> >
> > >
> > > - If you are using an existing language, be aware of any bugs and
> > > limitations in its Unicode implementation. You may or may not be able to
> > > work around them, but at least you can decide whether or not you wish to
> > > try.
> > >
> > > - If you are writing your own file system layer, it's 2015 fer fecks sake,
> > > file names should be Unicode strings, not bytes! (That's one part of the
> > > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> > > system, whichever you please, but again remember that both are
> > > variable-width formats.
> >
> > Correct.
> > Windows is broken for using UTF-16
> > Linux is broken for conflating UTF-8 and byte string.
> >
> > Lot of breakage out here dont you think?
> > May be related to the equation
> >
> > UTF-16 = UCS-2 + Duct-tape
> >
> > ??
>
> =============
>
> 1) A copy/paste of pp = ... from google group into
> my Python interactive interpreter without intermediate
> state.
> 2) Some manipulations.
> 3) A copy/paste from my interpreter into google group.
>
> I hope the rendering will be correct.
>
> Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
> >>> eta runs etazero.py...
> ...etazero has been executed
> >>> pp = "💩"
> >>> print(pp)
> 💩
> >>> len(pp)
> 2
> >>> pp + pp + 'abc需' + pp
> '💩💩abc需💩'
> >>>
> >>> # ok, nine glyphs, individually seleectable.
> >>>
>
>
> Note:
>
> len(pp) = 2 because of Py32. This is a deliberate
> choice to keep the Py32 "behaviour" in my interpreter.
>
> but also note:
>
> The code point is correctly displayed with a single "glyph".
> All the cut/copy/paste (eg word, pdf, ...), cursor mouvement,
> selection, caret position, text wrapping, char typing, ... mainly
> for rendering purpose is done with my internal "artillary",
> full unicode.
>
> In my other GUI applications, everything is working fine,
> including string lenghts, because my "artillary" work and
> also handle glyphs (including diacritical signs).
> Honestly, I'm no sure about bidi; however Hebrew I'm able
> to test is working fine.
>
> jmf
======
Rest Numéro 2.
Re-cut/copy/paste of what I sent into my
intepreter.
>>>
>>> len('💩💩abc需💩')
12
>>>
Ok, fine.
Windows, Firefox, utf-16, ... are not so bad.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2015-03-07 01:11 -0500 |
| Message-ID | <mailman.132.1425708701.21433.python-list@python.org> |
| In reply to | #87032 |
On 3/6/2015 11:20 AM, Rustom Mody wrote:
> =========
> pp = "💩"
> print (pp)
> =========
> Try open it in idle3 and you get (at least I get):
>
> $ idle3 ff.py
> Traceback (most recent call last):
> File "/usr/bin/idle3", line 5, in <module>
> main()
> File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> if flist.open(filename) is None:
> File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> edit = self.EditorWindow(self, filename, key)
> File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> EditorWindow.__init__(self, *args)
> File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> if io.loadfile(filename):
> File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> self.text.insert("1.0", chars)
> File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> self.top.insert(index, chars, tags)
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> self.addcmd(InsertCommand(index, chars, tags))
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> cmd.do(self.delegate)
> File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> text.insert(self.index1, self.chars, self.tags)
> File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> self.delegate.insert(index, chars, tags)
> File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> return self.tk_call(self.orig_and_operation + args)
> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?
tcl
The possible workaround is for Idle to translate "💩" to "\U0001f4a9"
(10 chars) before sending it to tk.
But some perspective. In the console interpreter:
>>> print("\U0001f4a9")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9'
in posit
ion 0: character maps to <undefined>
So what is broken? Windows Command Prompt.
More perspective. tk/Idle *will* print *something* for every BMP char.
Command Prompt will not. It does not even do ucs-2 correctly. So
which is more broken? Windows Command Prompt. Who has perhaps
1,000,000 times more resources, Microsoft? or the tcl/tk group? I think
we all know.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-06 23:43 -0800 |
| Message-ID | <1d283a0a-914e-4a59-9d7a-da6975dbeb8f@googlegroups.com> |
| In reply to | #87076 |
Le samedi 7 mars 2015 07:11:53 UTC+1, Terry Reedy a écrit :
> On 3/6/2015 11:20 AM, Rustom Mody wrote:
>
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> >
> > $ idle3 ff.py
> > Traceback (most recent call last):
> > File "/usr/bin/idle3", line 5, in <module>
> > main()
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> > if flist.open(filename) is None:
> > File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> > edit = self.EditorWindow(self, filename, key)
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> > EditorWindow.__init__(self, *args)
> > File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> > if io.loadfile(filename):
> > File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> > self.text.insert("1.0", chars)
> > File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> > self.top.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> > self.addcmd(InsertCommand(index, chars, tags))
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> > cmd.do(self.delegate)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> > text.insert(self.index1, self.chars, self.tags)
> > File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> > self.delegate.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> > return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> >
> > So who/what is broken?
>
> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9"
> (10 chars) before sending it to tk.
>
> But some perspective. In the console interpreter:
>
> >>> print("\U0001f4a9")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9'
> in posit
> ion 0: character maps to <undefined>
>
> So what is broken? Windows Command Prompt.
>
> More perspective. tk/Idle *will* print *something* for every BMP char.
> Command Prompt will not. It does not even do ucs-2 correctly. So
> which is more broken? Windows Command Prompt. Who has perhaps
> 1,000,000 times more resources, Microsoft? or the tcl/tk group? I think
> we all know.
>
> --
> Terry Jan Reedy
Well...
D:\jm>cd wuni
D:\jm\wuni>jmtest2
Py 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]
Quelques caractères: «abc需ßÜÆŸçñö»
Loop: empty string => quit
—>abc
Votre entrée était : abc 3 caractère(s)
—>abc需
Votre entrée était : abc需 6 caractère(s)
—>abc需\u20acz\u03b1\z\u0430z
Wahrscheinlich falsches \uxxxx, (single, invalid backslash)
—>abc需\u20acz\u03b1z\u0430z
Votre entrée était : abc需€zαzаz 12 caractère(s)
—>Москва\\Zürich\\Αθήνα
Votre entrée était : Москва\Zürich\Αθήνα 19 caractère(s)
—>
Fin
D:\jm\wuni>
Python is "more broken" than the Windows terminal.
C# works, Ruby works, julia works, go works, Python? NOT
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-07 00:55 -0800 |
| Message-ID | <a683f51e-20c5-4ab3-92fd-506b092e1dcf@googlegroups.com> |
| In reply to | #87076 |
Le samedi 7 mars 2015 07:11:53 UTC+1, Terry Reedy a écrit : > tcl > The possible workaround is for Idle to translate "💩" to "\U0001f4a9" > (10 chars) before sending it to tk. > Both are correct. It's a question of perspective. In an interpreter, which presents the "soul" of the language, "\U0001f4a9" has more sense than a glyph. For a general application, for an end user, displaying a glyph makes more sense. See, my previous comments. ---- Windows terminal: I do not wish to defend MS, but despite its "unicode limitations", it is working very well and it is certainly not buggy. Anyway, for serious apps, one writes GUI apps. tcl/tk? Yes, it is buggy and unusable (at least on Windows). jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-07 01:08 -0800 |
| Message-ID | <60ad6440-340b-4d45-be5c-7f0c4ad6a8af@googlegroups.com> |
| In reply to | #87079 |
Le samedi 7 mars 2015 09:56:09 UTC+1, wxjm...@gmail.com a écrit : > > tcl/tk? Yes, it is buggy and unusable (at least > on Windows). > > jmf Important addendum. Not because it does not handle non BMP (SMP) chars. It's buggy with the BMP chars. jmf
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-07 21:25 -0800 |
| Message-ID | <7d2480b3-8a39-40d7-aa95-4f1aae95a1f8@googlegroups.com> |
| In reply to | #87076 |
On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote:
> On 3/6/2015 11:20 AM, Rustom Mody wrote:
>
> > =========
> > pp = "💩"
> > print (pp)
> > =========
> > Try open it in idle3 and you get (at least I get):
> >
> > $ idle3 ff.py
> > Traceback (most recent call last):
> > File "/usr/bin/idle3", line 5, in <module>
> > main()
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
> > if flist.open(filename) is None:
> > File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
> > edit = self.EditorWindow(self, filename, key)
> > File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
> > EditorWindow.__init__(self, *args)
> > File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
> > if io.loadfile(filename):
> > File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
> > self.text.insert("1.0", chars)
> > File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
> > self.top.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
> > self.addcmd(InsertCommand(index, chars, tags))
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
> > cmd.do(self.delegate)
> > File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
> > text.insert(self.index1, self.chars, self.tags)
> > File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
> > self.delegate.insert(index, chars, tags)
> > File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
> > return self.tk_call(self.orig_and_operation + args)
> > _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
> >
> > So who/what is broken?
>
> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9"
> (10 chars) before sending it to tk.
>
> But some perspective. In the console interpreter:
>
> >>> print("\U0001f4a9")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9'
> in posit
> ion 0: character maps to <undefined>
>
> So what is broken? Windows Command Prompt.
>
> More perspective. tk/Idle *will* print *something* for every BMP char.
> Command Prompt will not. It does not even do ucs-2 correctly. So
> which is more broken? Windows Command Prompt. Who has perhaps
> 1,000,000 times more resources, Microsoft? or the tcl/tk group? I think
> we all know.
Thanks Terry for the perspective.
From my side:
No complaints about python or tcl (or idle -- its actually neater than emacs
if only emacs was not burnt into my nervous system)
Even unicode -- only marginal complaints.
I wrote http://blog.languager.org/2015/02/universal-unicode.html
precisely to say that unicode is a wonderful thing and one should be
enthusiastic
about it.
[You got that better than anyone else who has spoken -- Thanks]
Xah's pages are way more comprehensive than mine.
But comprehensive can be a negative -- ultimately the unicode standard is
the most comprehensive and correspondingly impenetrable without a compass.
The only very minor complaint I would make is:
If idle is unable to deal with SMP-chars and this is known and unlikely to change
(until TK changes), why not put up a dialog of the sort:
SMP char on line <nn>
SMP support currently unimplemented -- Sorry
instead of a backtrace?
[As I said just a suggestion]
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-07 22:09 +1100 |
| Message-ID | <54fadc70$0$13004$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #86986 |
Rustom Mody wrote: > On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: [...] >> Chris is suggesting that going from BMP to all of Unicode is not the hard >> part. Going from ASCII to the BMP part of Unicode is the hard part. If >> you can do that, you can go the rest of the way easily. > > Depends where the going is starting from. > I specifically names Java, Javascript, Windows... among others. > Here's some quotes from the supplementary chars doc of Java > http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html > > | Supplementary characters are characters in the Unicode standard whose > | code points are above U+FFFF, and which therefore cannot be described as > | single 16-bit entities such as the char data type in the Java > | programming language. Such characters are generally rare, but some are > | used, for example, as part of Chinese and Japanese personal names, and > | so support for them is commonly required for government applications in > | East Asian countries... > > | The introduction of supplementary characters unfortunately makes the > | character model quite a bit more complicated. > > | Unicode was originally designed as a fixed-width 16-bit character > | encoding. The primitive data type char in the Java programming language > | was intended to take advantage of this design by providing a simple data > | type that could hold > | any character.... Version 5.0 of the J2SE is required to support > | version 4.0 of the Unicode standard, so it has to support supplementary > | characters. > > My conclusion: Early adopters of unicode -- Windows and Java -- were > punished > for their early adoption. You can blame the unicode consortium, you can > blame the babel of human languages, particularly that some use characters > and some only (the equivalent of) what we call words. I see you are blaming everyone except the people actually to blame. It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years ago, the same year as 1.0 release of Java. Java has had eight major new releases since then. Oracle, and Sun before them, are/were serious, tier-1, world-class major IT companies. Why haven't they done something about introducing proper support for Unicode in Java? It's not hard -- if Python can do it using nothing but volunteers, Oracle can do it. They could even do it in a backwards-compatible way, by leaving the existing APIs in place and adding new APIs. As for Microsoft, as a member of the Unicode Consortium they have no excuse. But I think you exaggerate the lack of support for SMPs in Windows. Some parts of Windows have no SMP support, but they tend to be the oldest and less important (to Microsoft) parts, like the command prompt. Anyone have Powershell and like to see how well it supports SMP? This Stackoverflow question suggests that post-Windows 2000, the Windows file system has proper support for code points in the supplementary planes: http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua Or maybe not. > Or you can skip the blame-game and simply note the fact that large > segments of extant code-bases are currently in bug-prone or plain buggy > state. > > This includes not just bug-prone-system code such as Java and Windows but > seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? >> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in >> UTF-8 and UTF-32, since that goes against the grain of the system. You >> would have to program in artificial restrictions that otherwise don't >> exist. > > Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 > irrelevant. Glad you agree about that much at least. [...] >> Conclusion: faulty implementations of UTF-16 which incorrectly handle >> surrogate pairs should be replaced by non-faulty implementations, or >> changed to UTF-8 or UTF-32; incomplete Unicode implementations which >> assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should >> be upgraded. > > Imagine for a moment a thought experiment -- we are not on a python but a > java forum and please rewrite the above para. There is no need to re-write it. If Java's only implementation of Unicode assumes that code points are 16 bits only, then Java needs a new Unicode implementation. (I assume that the existing one cannot be changed for backwards-compatibility reasons.) > Are you addressing the vanilla java programmer? Language implementer? > Designer? The Java-funders -- earlier Sun, now Oracle? The last three should be considered the same people. The vanilla Java programmer is not responsible for the short-comings of Java's implementation. [...] >> > In practice, standards change. >> > However if a standard changes so frequently that that users have to >> > play catching cook and keep asking: "Which version?" they are justified >> > in asking "Are the standard-makers doing due diligence?" >> >> Since Unicode has stability guarantees, and the encodings have not >> changed in twenty years and will not change in the future, this argument >> is bogus. Updating to a new version of the standard means, to a first >> approximation, merely allocating some new code points which had >> previously been undefined but are now defined. >> >> (Code points can be flagged deprecated, but they will never be removed.) > > Its not about new code points; its about "Fits in 2 bytes" to "Does not > fit in 2 bytes" I quote you again: "if a standard changes so frequently..." The move to more than 16 bits happened once. It happened almost 20 years ago. In what way does this count as frequent changes? > If you call that argument bogus I call you a non computer scientist. I am not a computer scientist, and the argument remains bogus. Unicode does not change "frequently", and changes are backward-compatible. > [Essentially this is my issue with the consortium it seems to be working > [like a bunch of linguists not computer scientists] That's rather like complaining that some computer game looks like it was designed by games players instead of theoreticians. "Why, people have FUN playing this, almost like it was designed by professionals who think about gaming!!!" Unicode is a standard intended for the handling of human languages. It is intended as a real-life working standard, not some theoretical toy for academics to experiment with. It is designed to be used, not to have papers written about it. The character set part of it has effectively been designed by linguists, and that is a good thing. But the encoding side of things has been designed by practising computer programmers such as Rob Pike and Ken Thompson. You might have heard of them. > Here is Roy's Smith post that first started me thinking that something may > be wrong with SMP > https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ There are plenty of things wrong with some implementations of Unicode, those that assume all code points are two bytes. There may be a few things wrong with the current Unicode standard, such as missing characters, characters given the wrong name, and so forth. But there's nothing wrong with the design of the SMP. It allows the great majority of text, probably 99% or more, to use two bytes (UTF-16) or no more than three bytes (UTF-8), while only relatively specialised uses need four bytes for some code points. > Some parts are here some earlier and from my memory. > If details wrong please correct: > - 200 million records > - Containing 4 strings with SMP characters > - System made with python and mysql. SMP works with python, breaks mysql. > So whole system broke due to those 4 in 200,000,000 records No, they broke because MySQL has buggy Unicode handling. Bugs are not unusual. I used to have a version of Apple's Hypercard which would lock up the whole operating system if you tried to display the string "0^0" in a message dialog. Given that classic Mac OS was not a proper multi-tasking OS like Unix or OS-X or even Windows, this was a real pain. My conclusion from that is that that version of Hypercard was buggy. What is your conclusion? > I know enough (or not enough) of unicode to be chary of statistical > conclusions from the above. > My conclusion is essentially an 'existence-proof': > > SMP-chars can break systems. Oh come on. How about this instead? X can break systems, for every conceivable value of X. > The breakage is costly-fied by the combination > - layman statistical assumptions > - BMP → SMP exercises different code-paths > > It is necessary but not sufficient to test print "hello world" in ASCII, > BMP, SMP. You also have to write the hello world in the database -- mysql > Read it from the webform -- javascript > etc etc Yes. This is called "integration testing". That's what professionals do. > You could also choose do with "astral crap" (Roy's words) what we all do > with crap -- throw it out as early as possible. And when Roy's customers demand that his product support emoji, or complain that they cannot spell their own name because of his parochial and ignorant idea of "crap", perhaps he will consider doing what he should have done from the beginning: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. [1] So I have been told. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 22:33 +1100 |
| Message-ID | <mailman.137.1425728048.21433.python-list@python.org> |
| In reply to | #87083 |
On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Stop using MySQL, which is a joke of a database[1], and use Postgres which > does not have this problem. I agree with the recommendation, though to be fair to MySQL, it is now possible to store full Unicode. Though personally, I think the whole "UTF8MB3 vs UTF8MB4" split is an embarrassment and should be abolished *immediately* - not "we may change the meaning of UTF8 to be an alias for UTF8MB4 in the future", just completely abolish the distinction right now. (And deprecate the longer words.) There should be no reason to build any kind of "UTF-8 but limited to three bytes" encoding for anything. Ever. But at least you can, if you configure things correctly, store any Unicode character in your TEXT field. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 13:53 +0200 |
| Message-ID | <87twxxxbvd.fsf@elektro.pacujo.net> |
| In reply to | #87083 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>: > Rustom Mody wrote: >> My conclusion: Early adopters of unicode -- Windows and Java -- were >> punished for their early adoption. You can blame the unicode >> consortium, you can blame the babel of human languages, particularly >> that some use characters and some only (the equivalent of) what we >> call words. > > I see you are blaming everyone except the people actually to blame. I don't think you need to blame anybody. I think the UCS-2 mistake was both deplorable and very understandable. At the time it looked like the magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's looked like a hugely expensive price, it was deemed forward-looking to pay it anyway to resolve the character set problem once and for all. Linux was lucky to join the fray late enough to benefit from the bad UCS-2 experience. That said, UTF-8 does suffer badly from its not being a bijective mapping. (Linux didn't quite dodge the bullet with pthreads, threads being another sad fad of the 1990's. The hippies that cooked up the fork system call should be awarded the next Millennium Prize. That foresight or stroke of luck has withstood the challenge of half a century.) > But there's nothing wrong with the design of the SMP. It allows the > great majority of text, probably 99% or more, to use two bytes > (UTF-16) or no more than three bytes (UTF-8), while only relatively > specialised uses need four bytes for some code points. The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. It may yet be that UTF-32 replaces all previous schemes since it has all the benefits of ASCII and only one drawback: redundancy. Maybe one day we'll declare the byte 32 bits wide and be done with it. In some many other aspects, 32-bit "bytes" are the de-facto reality already. Even C coders routinely use 32 bits to express boolean values. > And when Roy's customers demand that his product support emoji, or > complain that they cannot spell their own name because of his > parochial and ignorant idea of "crap", perhaps he will consider doing > what he should have done from the beginning: That's a recurring theme: Why didn't we do IPv6 from the get-go? Why didn't we do multi-user from the get-go? Why didn't we do localization from the get-go? There comes a point when you have to release to start making money. You then suffer the consequences until your company goes bankrupt. Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-07 23:02 +1100 |
| Message-ID | <mailman.139.1425729786.21433.python-list@python.org> |
| In reply to | #87085 |
On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <marko@pacujo.net> wrote: > The main dream was a fixed-width encoding scheme. People thought 16 bits > would be enough. The dream is so precious and true to us in the West > that people don't want to give it up. So... use Pike, or Python 3.3+? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 14:07 +0000 |
| Message-ID | <mailman.142.1425737245.21433.python-list@python.org> |
| In reply to | #87085 |
On 07/03/2015 12:02, Chris Angelico wrote: > On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <marko@pacujo.net> wrote: >> The main dream was a fixed-width encoding scheme. People thought 16 bits >> would be enough. The dream is so precious and true to us in the West >> that people don't want to give it up. > > So... use Pike, or Python 3.3+? > > ChrisA > Cue obligatory cobblers from our RUE. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-07 07:28 -0800 |
| Message-ID | <57fb30fd-4efb-4e50-9708-96f4e108b870@googlegroups.com> |
| In reply to | #87085 |
Le samedi 7 mars 2015 12:53:24 UTC+1, Marko Rauhamaa a écrit : > > It may yet be that UTF-32 replaces all previous schemes since it has all > the benefits of ASCII and only one drawback: redundancy. Maybe one day > we'll declare the byte 32 bits wide and be done with it. In some many > other aspects, 32-bit "bytes" are the de-facto reality already. Even C > coders routinely use 32 bits to express boolean values. > Like many, I'm using utf-32 every day on my win7 box with 2 Gb of ram. I never meet once a problem. jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-08 02:40 +1100 |
| Message-ID | <54fb1bf4$0$12993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87085 |
Marko Rauhamaa wrote:
> That said, UTF-8 does suffer badly from its not being
> a bijective mapping.
Can you explain?
As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.
There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.
E.g. SMP code points should be encoded to four bytes using UTF-8:
py> u'\U0010FF01'.encode('utf-8') # U+10FF01
'\xf4\x8f\xbc\x81'
But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:
py> u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'
then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:
py> u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py> u'\udf01'.encode('utf-8')
'\xed\xbc\x81'
giving six bytes in total:
'\xed\xaf\xbf\xed\xbc\x81'
This is not UTF-8! But some software mislabels it as UTF-8.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 17:48 +0200 |
| Message-ID | <87twxw4xlz.fsf@elektro.pacujo.net> |
| In reply to | #87091 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> Marko Rauhamaa wrote:
>
>> That said, UTF-8 does suffer badly from its not being
>> a bijective mapping.
>
> Can you explain?
In Python terms, there are bytes objects b that don't satisfy:
b.decode('utf-8').encode('utf-8') == b
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 03:17 +1100 |
| Message-ID | <mailman.145.1425745085.21433.python-list@python.org> |
| In reply to | #87092 |
On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Marko Rauhamaa wrote:
>>
>>> That said, UTF-8 does suffer badly from its not being
>>> a bijective mapping.
>>
>> Can you explain?
>
> In Python terms, there are bytes objects b that don't satisfy:
>
> b.decode('utf-8').encode('utf-8') == b
Please provide an example; that sounds like a bug. If there is any
invalid UTF-8 stream which decodes without an error, it is actually a
security bug, and should be fixed pronto in all affected and supported
versions.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 18:25 +0200 |
| Message-ID | <87k2ysydtk.fsf@elektro.pacujo.net> |
| In reply to | #87099 |
Chris Angelico <rosuav@gmail.com>:
> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>
>>> Marko Rauhamaa wrote:
>>>
>>>> That said, UTF-8 does suffer badly from its not being
>>>> a bijective mapping.
>>>
>>> Can you explain?
>>
>> In Python terms, there are bytes objects b that don't satisfy:
>>
>> b.decode('utf-8').encode('utf-8') == b
>
> Please provide an example; that sounds like a bug. If there is any
> invalid UTF-8 stream which decodes without an error, it is actually a
> security bug, and should be fixed pronto in all affected and supported
> versions.
Here's an example:
b = b'\x80'
Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 03:41 +1100 |
| Message-ID | <mailman.148.1425746496.21433.python-list@python.org> |
| In reply to | #87100 |
On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Chris Angelico <rosuav@gmail.com>:
>
>> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>>>
>>>> Marko Rauhamaa wrote:
>>>>
>>>>> That said, UTF-8 does suffer badly from its not being
>>>>> a bijective mapping.
>>>>
>>>> Can you explain?
>>>
>>> In Python terms, there are bytes objects b that don't satisfy:
>>>
>>> b.decode('utf-8').encode('utf-8') == b
>>
>> Please provide an example; that sounds like a bug. If there is any
>> invalid UTF-8 stream which decodes without an error, it is actually a
>> security bug, and should be fixed pronto in all affected and supported
>> versions.
>
> Here's an example:
>
> b = b'\x80'
>
> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
> from str objects to bytes objects.
That's not the same as what you said. All you've proven is that there
are bit patterns which are not UTF-8 streams... which is a very
deliberate feature. How does UTF-8 *suffer* from this? It benefits
hugely!
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-03-07 18:54 +0200 |
| Message-ID | <87bnk4yci1.fsf@elektro.pacujo.net> |
| In reply to | #87103 |
Chris Angelico <rosuav@gmail.com>: > On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >>>>> Marko Rauhamaa wrote: >>>>>> That said, UTF-8 does suffer badly from its not being >>>>>> a bijective mapping. >>>>> >> Here's an example: >> >> b = b'\x80' >> >> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping >> from str objects to bytes objects. > > That's not the same as what you said. Except that it's precisely what I said. > All you've proven is that there are bit patterns which are not UTF-8 > streams... And that causes problems. > which is a very deliberate feature. Well, nobody desired it. It was just something that had to give. I believe you *could* have defined it as a bijective mapping but then you would have lost the sorting order correspondence. > How does UTF-8 *suffer* from this? It benefits hugely! You can't operate on file names and text files using Python strings. Or at least, you will need to add (nontrivial) exception catching logic. Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 03:58 +1100 |
| Message-ID | <mailman.151.1425747492.21433.python-list@python.org> |
| In reply to | #87108 |
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa <marko@pacujo.net> wrote: >> All you've proven is that there are bit patterns which are not UTF-8 >> streams... > > And that causes problems. Demonstrate. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-08 04:00 +1100 |
| Message-ID | <mailman.152.1425747654.21433.python-list@python.org> |
| In reply to | #87108 |
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa <marko@pacujo.net> wrote: > You can't operate on file names and text files using Python strings. Or > at least, you will need to add (nontrivial) exception catching logic. You can't operate on a JPG file using a Unicode string, nor an array of integers. What of it? You can't operate on an array of integers using a dictionary, either. So? How is this a failing of UTF-8? If you really REALLY can't use the bytes() type to work with something that is, yaknow, bytes, then you could use an alternative encoding that has a value for every byte. It's still not Unicode text, so it doesn't much matter which encoding you use. But it's much better to use the bytes type to work with bytes. It is not text, so don't treat it as text. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web