Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #86311 > unrolled thread
| Started by | pierrick.brihaye@gmail.com |
|---|---|
| First post | 2015-02-24 02:49 -0800 |
| Last post | 2015-02-27 10:23 +1100 |
| Articles | 18 on this page of 158 — 19 participants |
Back to article view | Back to comp.lang.python
Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100
Page 8 of 8 — ← Prev page 1 2 3 4 5 6 7 [8]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-03-07 14:13 +0000 |
| Message-ID | <mailman.143.1425737633.21433.python-list@python.org> |
| In reply to | #87083 |
On 07/03/2015 11:09, Steven D'Aprano wrote: > Rustom Mody wrote: > >> >> This includes not just bug-prone-system code such as Java and Windows but >> seemingly working code such as python 3. > > What Unicode bugs do you think Python 3.3 and above have? > Methinks somebody has been drinking too much loony juice. Either that or taking too much notice of our RUE. Not that I've done a proper analysis, but to my knowledge there's nothing like the number of issues on the bug tracker for Unicode bugs for Python 3 compared to Python 2. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-07 23:23 -0800 |
| Message-ID | <7cdb210c-c152-41a6-8afa-a0c0028f454e@googlegroups.com> |
| In reply to | #87083 |
On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > This includes not just bug-prone-system code such as Java and Windows but > > seemingly working code such as python 3. > > What Unicode bugs do you think Python 3.3 and above have? Literal/Legalistic answer: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135 [And already quoted at http://blog.languager.org/2015/03/whimsical-unicode.html ] An answer more in the spirit of what I am trying to say: Idle3, Roy's example and in general all systems that are python-centric but use components outside of python that are unicode-broken IOW I would expect people (at least people with good faith) reading my > bug-prone-system code...seemingly working code such as python 3... to interpret that NOT as "python 3 is seemingly working but actually broken" But as "Apps made with working system code (eg python3) can end up being broken because of other non-working system code - eg mysql, java, javascript, windows-shell, and ultimately windows, linux"
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-09 05:30 +1100 |
| Message-ID | <54fc9556$0$12994$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #87134 |
Rustom Mody wrote:
> On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
>> Rustom Mody wrote:
>> > This includes not just bug-prone-system code such as Java and Windows
>> > but seemingly working code such as python 3.
>>
>> What Unicode bugs do you think Python 3.3 and above have?
>
> Literal/Legalistic answer:
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135
Nice one :-) but not exactly in the spirit of what we're discussing (as you
acknowledge below), so I won't discuss that.
> [And already quoted at
> http://blog.languager.org/2015/03/whimsical-unicode.html
> ]
>
> An answer more in the spirit of what I am trying to say:
> Idle3, Roy's example and in general all systems that are
> python-centric but use components outside of python that are
> unicode-broken
>
> IOW I would expect people (at least people with good faith) reading my
>
>> bug-prone-system code...seemingly working code such as python 3...
>
> to interpret that NOT as
>
> "python 3 is seemingly working but actually broken"
Why not? That is the natural interpretation of the sentence, particularly in
the context of your previous sentence:
[quote]
Or you can skip the blame-game and simply note the fact that
large segments of extant code-bases are currently in bug-prone
or plain buggy state.
This includes not just bug-prone-system code such as Java and
Windows but seemingly working code such as python 3.
[end quote]
The natural interpretation of this is that Python 3 is only *seemingly*
working, but is also an example of a code base in "bug-prone or plain buggy
state".
If that's not your intended meaning, then rather than casting aspersions on
my honesty ("good faith" indeed) you might accept that perhaps you didn't
quite manage to get your message across.
> But as
>
> "Apps made with working system code (eg python3) can end up being broken
> because of other non-working system code - eg mysql, java, javascript,
> windows-shell, and ultimately windows, linux"
Don't forget viruses or other malware, cosmic rays, processor bugs, dry
solder joints on the motherboard, faulty memory, and user-error.
I'm not sure what point you think you are making. If you want to discuss the
fact that complex systems have more interactions than simple systems, and
therefore more ways for things to go wrong, I will agree. I'll agree that
this is an issue with Python code that interacts with other systems which
may or may not implement Unicode correctly. There are a few ways to
interpret this:
(1) You're making a general point about the complexity of modern computing.
(2) You're making the point that dealing with text encodings in general, and
Unicode in specific, is hard because of the interaction of programming
language, database, file system, locale, etc.
(3) You're implying that Python ought to fix this problem some how.
(4) You're implying that *Unicode* specifically is uniquely problematic in
this way. Or at least *unusual* to be problematic in this way.
I will agree with 1 and 2; I'll say that 3 would be nice but in the absence
of concrete proposals for how to fix it, it's just meaningless chatter. And
I'll disagree strongly with 4.
Unicode came into existence because legacy encodings suffer from similar
problems, only worse. (One major advantage of Unicode over previous
multi-byte encodings is that the UTF encodings are self-healing. A single
corrupted byte will, *at worst*, cause a single corrupted code point.)
In one sense, Unicode has solved these legacy encoding problems, in the
sense that if you always use a correct implementation of Unicode then you
won't *ever* suffer from problems like moji-bake, broken strings and so
forth.
In another sense, Unicode hasn't solved these legacy problems because we
still have to deal with files using legacy encodings, as well as standards
organisations, operating systems, developers, applications and users who
continue to produce new content using legacy encodings, buggy or incorrect
implementations of the standard, also viruses, cosmic rays, dry solder
joints and user-error. How are these things Unicode's fault or
responsibility?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2015-03-09 13:09 +1100 |
| Message-ID | <mailman.182.1425866969.21433.python-list@python.org> |
| In reply to | #87083 |
On 07Mar2015 22:09, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >Rustom Mody wrote: >>[...big snip...] >> Some parts are here some earlier and from my memory. >> If details wrong please correct: >> - 200 million records >> - Containing 4 strings with SMP characters >> - System made with python and mysql. SMP works with python, breaks mysql. >> So whole system broke due to those 4 in 200,000,000 records > >No, they broke because MySQL has buggy Unicode handling. [...] >> You could also choose do with "astral crap" (Roy's words) what we all do >> with crap -- throw it out as early as possible. > >And when Roy's customers demand that his product support emoji, or complain >that they cannot spell their own name because of his parochial and ignorant >idea of "crap", perhaps he will consider doing what he should have done >from the beginning: > >Stop using MySQL, which is a joke of a database[1], and use Postgres which >does not have this problem. > >[1] So I have been told. I use MySQL a fair bit, and Postgres very slightly. I would agree with your characterisation above; MySQL is littered with inconsistencies and arbitrary breakage, both in tools and SQL implementation. And Postgres has been a pure pleasure to work with, little though I have done that so far. Cheers, Cameron Simpson <cs@zip.com.au> There is no human problem which could not be solved if people would simply do as I advise. - Gore Vidal
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-08 19:42 -0700 |
| Message-ID | <bf5d739e-965a-4dbd-bd11-c322bd0dbe28@googlegroups.com> |
| In reply to | #87168 |
On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote: > On 07Mar2015 22:09, Steven D'Aprano wrote: > >Rustom Mody wrote: > >>[...big snip...] > >> Some parts are here some earlier and from my memory. > >> If details wrong please correct: > >> - 200 million records > >> - Containing 4 strings with SMP characters > >> - System made with python and mysql. SMP works with python, breaks mysql. > >> So whole system broke due to those 4 in 200,000,000 records > > > >No, they broke because MySQL has buggy Unicode handling. > [...] > >> You could also choose do with "astral crap" (Roy's words) what we all do > >> with crap -- throw it out as early as possible. > > > >And when Roy's customers demand that his product support emoji, or complain > >that they cannot spell their own name because of his parochial and ignorant > >idea of "crap", perhaps he will consider doing what he should have done > >from the beginning: > > > >Stop using MySQL, which is a joke of a database[1], and use Postgres which > >does not have this problem. > > > >[1] So I have been told. > > I use MySQL a fair bit, and Postgres very slightly. I would agree with your > characterisation above; MySQL is littered with inconsistencies and arbitrary > breakage, both in tools and SQL implementation. And Postgres has been a pure > pleasure to work with, little though I have done that so far. > > Cheers, > Cameron Simpson > > There is no human problem which could not be solved if people would simply > do as I advise. - Gore Vidal I think that last quote sums up the issue best. Ive written to Intel asking them to make their next generation have 21-bit wide bytes. Once they do that we will be back in the paradise we have been for the last 40 years which I call the 'Unix-assumption' http://blog.languager.org/2014/04/unicode-and-unix-assumption.html Until then... We have to continue living in the real world. Which includes 10 times more windows than linux users. Is windows 10 times better an OS than linux? In the 'real world' people make choices for all sorts of reasons. My guess is the top reason is the pointiness of the hair of pointy-haired-boss. Just like people choose windows over linux, people choose mysql over postgres, and that's the context of this discussion -- people stuck in sub-optimal choices
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-04 19:16 +1100 |
| Message-ID | <54f6bf5a$0$11122$c3e8da3@news.astraweb.com> |
| In reply to | #86886 |
Chris Angelico wrote: > Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why > do you keep talking about 7.0 as if it's a recent change? This is the Internet. Lack of knowledge about something doesn't prevent people from having opinions about it. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-04 05:43 +1100 |
| Message-ID | <mailman.24.1425408236.21433.python-list@python.org> |
| In reply to | #86856 |
On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody <rustompmody@gmail.com> wrote: > What I was trying to say expanded here > http://blog.languager.org/2015/03/whimsical-unicode.html > [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] Re footnote #4: ½ is a single character for compatibility reasons. ⅟₁₀₀ doesn't need to be a single character, because there are countably infinite vulgar fractions and only 0x110000 Unicode characters. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-03 18:53 -0800 |
| Message-ID | <18d9d5a7-dfb9-4e13-ada3-5ef97cf0543d@googlegroups.com> |
| In reply to | #86859 |
On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote: > On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote: > > What I was trying to say expanded here > > http://blog.languager.org/2015/03/whimsical-unicode.html > > [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] > > Re footnote #4: ½ is a single character for compatibility reasons. > ⅟₁₀₀ ... ^^^ Neat Thanks [And figured out some of quopri module along the way figuring that out]
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2015-03-03 18:30 -0500 |
| Message-ID | <mailman.27.1425425434.21433.python-list@python.org> |
| In reply to | #86856 |
On 3/3/2015 1:03 PM, Rustom Mody wrote: > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: >> You should add emoticons, but not call them or the above 'gibberish'. >> I think that this part of your post is more 'unprofessional' than the >> character blocks. It is very jarring and seems contrary to your main point. > > Ok Done > > References to gibberish removed from > http://blog.languager.org/2015/02/universal-unicode.html > > What I was trying to say expanded here > http://blog.languager.org/2015/03/whimsical-unicode.html > [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] I agree with both. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-04 13:54 +1100 |
| Message-ID | <54f673e4$0$12980$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #86856 |
Rustom Mody wrote: > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: >> On 2/26/2015 8:24 AM, Chris Angelico wrote: >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: >> >> Wrote something up on why we should stop using ASCII: >> >> http://blog.languager.org/2015/02/universal-unicode.html >> >> I think that the main point of the post, that many Unicode chars are >> truly planetary rather than just national/regional, is excellent. > > <snipped> > >> You should add emoticons, but not call them or the above 'gibberish'. >> I think that this part of your post is more 'unprofessional' than the >> character blocks. It is very jarring and seems contrary to your main >> point. > > Ok Done > > References to gibberish removed from > http://blog.languager.org/2015/02/universal-unicode.html I consider it unethical to make semantic changes to a published work in place without acknowledgement. Fixing minor typos or spelling errors, or dead links, is okay. But any edit that changes the meaning should be commented on, either by an explicit note on the page itself, or by striking out the previous content and inserting the new. As for the content of the essay, it is currently rather unfocused. It appears to be more of a list of "here are some Unicode characters I think are interesting, divided into subgroups, oh and here are some I personally don't have any use for, which makes them silly" than any sort of discussion about the universality of Unicode. That makes it rather idiosyncratic and parochial. Why should obscure maths symbols be given more importance than obscure historical languages? I think that the universality of Unicode could be explained in a single sentence: "It is the aim of Unicode to be the one character set anyone needs to represent every character, ideogram or symbol (but not necessarily distinct glyph) from any existing or historical human language." I can expand on that, but in a nutshell that is it. You state: "APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset ..." but I don't think that is correct. I'm pretty sure that neither APL nor Z allowed you to define new characters. They might not have used ASCII alone, but they still had a restricted character set. It was merely less restricted than ASCII. You make a comment about Cobol's relative unpopularity, but (1) Cobol doesn't require you to write out numbers as English words, and (2) Cobol is still used, there are uncounted billions of lines of Cobol code being used, and if the number of Cobol programmers is less now than it was 16 years ago, there are still a lot of them. Academics and FOSS programmers don't think much of Cobol, but it has to count as one of the most amazing success stories in the field of programming languages, despite its lousy design. You list ideographs such as Cuneiform under "Icons". They are not icons. They are a mixture of symbols used for consonants, syllables, and logophonetic, consonantal alphabetic and syllabic signs. That sits them firmly in the same categories as modern languages with consonants, ideogram languages like Chinese, and syllabary languages like Cheyenne. Just because native readers of Cuneiform are all dead doesn't make Cuneiform unimportant. There are probably more people who need to write Cuneiform than people who need to write APL source code. You make a comment: "To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these??" But that is nonsense, and it contradicts your earlier quoting of Dave Angel. Why are you so worried about an (illusionary) minor optimization? Whether code points are allocated or not doesn't affect how much space they take up. There are millions of unused Unicode code points today. If they are allocated tomorrow, the space your documents take up will not increase one byte. Allocating code points to Cuneiform has not increased the space needed by Unicode at all. Two bytes alone is not enough for even existing human languages (thanks China). For hardware related reasons, it is faster and more efficient to use four bytes than three, so the obvious and "dumb" (in the simplest thing which will work) way to store Unicode is UTF-32, which takes a full four bytes per code point, regardless of whether there are 65537 code points or 1114112. That makes it less expensive than floating point numbers, which take eight. Would you like to argue that floating point doubles are "unprofessional" and wasteful? As Dave pointed out, and you apparently agreed with him enough to quote him TWICE (once in each of two blog posts), history of computing is full of premature optimizations for space. (In fact, some of these may have been justified by the technical limitations of the day.) Technically Unicode is also limited, but it is limited to over one million code points, 1114112 to be exact, although some of them are reserved as invalid for technical reasons, and there is no indication that we'll ever run out of space in Unicode. In practice, there are three common Unicode encodings that nearly all Unicode documents will use. * UTF-8 will use between one and (by memory) four bytes per code point. For Western European languages, that will be mostly one or two bytes per character. * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual Plane, which is enough for nearly all Western European writing and much East Asian writing as well. For the rest, it uses a fixed four bytes per code point. * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses this as a storage format. In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode doesn't change the space used. If you actually include a few hieroglyphs to your document, the space increases only by the actual space used by those hieroglyphs: four bytes per hieroglyph. At no time does the existence of a single hieroglyph in your document force you to expand the non-hieroglyph characters to use more space. > What I was trying to say expanded here > http://blog.languager.org/2015/03/whimsical-unicode.html You have at least two broken links, referring to a non-existent page: http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html This essay seems to be even more rambling and unfocused than the first. What does the cost of semi-conductor plants have to do with whether or not programmers support Unicode in their applications? Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, it isn't so silly. If your text begins with the UTF-8 mark, treat it as UTF-8. It's no more silly than any other heuristic, like HTML encoding tags or text editor's encoding cookies. Your discussion of "complexifiers and simplifiers" doesn't seem to be terribly relevant, or at least if it is relevant, you don't give any reason for it. The whole thing about Moore's Law and the cost of semi-conductor plants seems irrelevant to Unicode except in the most over-generalised sense of "things are bigger today than in the past, we've gone from five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? You agree that 16-bits are not enough, and yet you critice Unicode for using more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is an inconsistent position to take. UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support. The problem is when your language treats UTF-16 as a fixed-width two-byte format instead of a variable-width, two- or four-byte format. (That's more or less like the old, obsolete, UCS-2 standard.) There are all sorts of good ways to solve the problem of surrogate pairs and the SMPs in UTF-16. If some programming languages or software fails to do so, they are buggy, not UTF-16. After explaining that 16 bits are not enough, you then propose a 16 bit standard. /face-palm UTF-16 cannot break the fixed with invariant, because it has no fixed width invariant. That's like arguing against UTF-8 because it breaks the fixed width invariant "all characters are single byte ASCII characters". If you cannot handle SMP characters, you are not supporting Unicode. You suggest that Chinese users should be looking at Big5 or GB. I really, really don't think so. - Neither is universal. What makes you think that Chinese writers need to use maths symbols, or include (say) Thai or Russian in their work any less than Western writers do? - Neither even support all of Chinese. Big5 supports Traditional Chinese, but not Simplified Chinese. GB supports Simplified Chinese, but not Traditional Chinese. - Big5 likewise doesn't support placenames, many people's names, and other less common parts of Chinese. - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort of data corruption issues. - There is no one single Big5 standard, but a whole lot of vendor extensions. You say: "I just want to suggest that the Unicode consortium going overboard in adding zillions of codepoints of nearly zero usefulness, is in fact undermining unicode’s popularity and spread." Can you demonstrate this? Can you show somebody who says "Well, I was going to support full Unicode, but since they added a snowman, I'm going to stick to ASCII"? The "whimsical" characters you are complaining about were important enough to somebody to spend significant amounts of time and money to write up a proposal, have it go through the Unicode Consortium bureaucracy, and eventually have it accepted. That's not easy or cheap, and people didn't add a snowman on a whim. They did it because there are a whole lot of people who want a shared standard for map symbols. It is easy to mock what is not important to you. I daresay kids adding emoji to their 10 character tweets would mock all the useless maths symbols in Unicode too. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-03-04 14:02 +1100 |
| Message-ID | <mailman.29.1425438171.21433.python-list@python.org> |
| In reply to | #86874 |
On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > It is easy to mock what is not important to you. I daresay kids adding emoji > to their 10 character tweets would mock all the useless maths symbols in > Unicode too. Definitely! Who ever sings "do you wanna build an integral sign"? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-03 20:05 -0800 |
| Message-ID | <601f597e-719a-4721-9620-1a7ea43de57d@googlegroups.com> |
| In reply to | #86874 |
On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > >> On 2/26/2015 8:24 AM, Chris Angelico wrote: > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > >> >> Wrote something up on why we should stop using ASCII: > >> >> http://blog.languager.org/2015/02/universal-unicode.html > >> > >> I think that the main point of the post, that many Unicode chars are > >> truly planetary rather than just national/regional, is excellent. > > > > <snipped> > > > >> You should add emoticons, but not call them or the above 'gibberish'. > >> I think that this part of your post is more 'unprofessional' than the > >> character blocks. It is very jarring and seems contrary to your main > >> point. > > > > Ok Done > > > > References to gibberish removed from > > http://blog.languager.org/2015/02/universal-unicode.html > > I consider it unethical to make semantic changes to a published work in > place without acknowledgement. Fixing minor typos or spelling errors, or > dead links, is okay. But any edit that changes the meaning should be > commented on, either by an explicit note on the page itself, or by striking > out the previous content and inserting the new. Dunno What you are grumping about… Anyway the attribution is made more explicit – footnote 5 in http://blog.languager.org/2015/03/whimsical-unicode.html. Note Terry Reedy's post who mainly objected was already acked earlier. Ive just added one more ack¹ And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication. > > As for the content of the essay, it is currently rather unfocused. True. It > appears to be more of a list of "here are some Unicode characters I think > are interesting, divided into subgroups, oh and here are some I personally > don't have any use for, which makes them silly" than any sort of discussion > about the universality of Unicode. That makes it rather idiosyncratic and > parochial. Why should obscure maths symbols be given more importance than > obscure historical languages? Idiosyncratic ≠ parochial > > I think that the universality of Unicode could be explained in a single > sentence: > > "It is the aim of Unicode to be the one character set anyone needs to > represent every character, ideogram or symbol (but not necessarily distinct > glyph) from any existing or historical human language." > > I can expand on that, but in a nutshell that is it. > > > You state: > > "APL and Z Notation are two notable languages APL is a programming language > and Z a specification language that did not tie themselves down to a > restricted charset ..." Tsk Tsk – dihonest snipping. I wrote | APL and Z Notation are two notable languages APL is a programming language | and Z a specification language that did not tie themselves down to a | restricted charset even in the day that ASCII ruled. so its clear that the restricted applies to ASCII > > You list ideographs such as Cuneiform under "Icons". They are not icons. > They are a mixture of symbols used for consonants, syllables, and > logophonetic, consonantal alphabetic and syllabic signs. That sits them > firmly in the same categories as modern languages with consonants, ideogram > languages like Chinese, and syllabary languages like Cheyenne. Ok changed to iconic. Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages. In 2015 when someone sees them and recognizes them, they are 'those things that Sumerians/Egyptians wrote' No one except a rare expert knows those languages > > Just because native readers of Cuneiform are all dead doesn't make Cuneiform > unimportant. There are probably more people who need to write Cuneiform > than people who need to write APL source code. > > You make a comment: > > "To me – a unicode-layman – it looks unprofessional… Billions of computing > devices world over, each having billions of storage words having their > storage wasted on blocks such as these??" > > But that is nonsense, and it contradicts your earlier quoting of Dave Angel. > Why are you so worried about an (illusionary) minor optimization? 2 < 4 as far as I am concerned. [If you disagree one man's illusionary is another's waking] > > Whether code points are allocated or not doesn't affect how much space they > take up. There are millions of unused Unicode code points today. If they > are allocated tomorrow, the space your documents take up will not increase > one byte. > > Allocating code points to Cuneiform has not increased the space needed by > Unicode at all. Two bytes alone is not enough for even existing human > languages (thanks China). For hardware related reasons, it is faster and > more efficient to use four bytes than three, so the obvious and "dumb" (in > the simplest thing which will work) way to store Unicode is UTF-32, which > takes a full four bytes per code point, regardless of whether there are > 65537 code points or 1114112. That makes it less expensive than floating > point numbers, which take eight. Would you like to argue that floating > point doubles are "unprofessional" and wasteful? > > As Dave pointed out, and you apparently agreed with him enough to quote him > TWICE (once in each of two blog posts), history of computing is full of > premature optimizations for space. (In fact, some of these may have been > justified by the technical limitations of the day.) Technically Unicode is > also limited, but it is limited to over one million code points, 1114112 to > be exact, although some of them are reserved as invalid for technical > reasons, and there is no indication that we'll ever run out of space in > Unicode. > > In practice, there are three common Unicode encodings that nearly all > Unicode documents will use. > > * UTF-8 will use between one and (by memory) four bytes per code > point. For Western European languages, that will be mostly one > or two bytes per character. > > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual > Plane, which is enough for nearly all Western European writing and > much East Asian writing as well. For the rest, it uses a fixed four > bytes per code point. > > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses > this as a storage format. > > > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode > doesn't change the space used. If you actually include a few hieroglyphs to > your document, the space increases only by the actual space used by those > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a > single hieroglyph in your document force you to expand the non-hieroglyph > characters to use more space. > > > > What I was trying to say expanded here > > http://blog.languager.org/2015/03/whimsical-unicode.html > > You have at least two broken links, referring to a non-existent page: > > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html Thanks corrected > > This essay seems to be even more rambling and unfocused than the first. What > does the cost of semi-conductor plants have to do with whether or not > programmers support Unicode in their applications? > > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, > it isn't so silly. If your text begins with the UTF-8 mark, treat it as > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags > or text editor's encoding cookies. > > Your discussion of "complexifiers and simplifiers" doesn't seem to be > terribly relevant, or at least if it is relevant, you don't give any reason > for it. The whole thing about Moore's Law and the cost of semi-conductor > plants seems irrelevant to Unicode except in the most over-generalised > sense of "things are bigger today than in the past, we've gone from > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? - Most people need only 16 bits. - Many notable examples of software fail going from 16 to 23. - If you are a software writer, and you fail going 16 to 23 its ok but try to give useful errors > > You agree that 16-bits are not enough, and yet you critice Unicode for using > more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is > an inconsistent position to take. | ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support – | ASCII. BMP-only Unicode is universal enough but within practical limits | whereas full (7.0) Unicode is 'really' universal at a cost of performance and | whimsicality. Do you disagree that BMP-only = 16 bits? > > UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support. > > The problem is when your language treats UTF-16 as a fixed-width two-byte > format instead of a variable-width, two- or four-byte format. (That's more > or less like the old, obsolete, UCS-2 standard.) There are all sorts of > good ways to solve the problem of surrogate pairs and the SMPs in UTF-16. > If some programming languages or software fails to do so, they are buggy, > not UTF-16. > > After explaining that 16 bits are not enough, you then propose a 16 bit > standard. /face-palm > > UTF-16 cannot break the fixed with invariant, because it has no fixed width > invariant. That's like arguing against UTF-8 because it breaks the fixed > width invariant "all characters are single byte ASCII characters". > > If you cannot handle SMP characters, you are not supporting Unicode. 7.0 > > > You suggest that Chinese users should be looking at Big5 or GB. I really, > really don't think so. > > - Neither is universal. What makes you think that Chinese writers need > to use maths symbols, or include (say) Thai or Russian in their work > any less than Western writers do? > > - Neither even support all of Chinese. Big5 supports Traditional > Chinese, but not Simplified Chinese. GB supports Simplified > Chinese, but not Traditional Chinese. > > - Big5 likewise doesn't support placenames, many people's names, and > other less common parts of Chinese. > > - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort > of data corruption issues. > > - There is no one single Big5 standard, but a whole lot of vendor > extensions. > > > You say: > > "I just want to suggest that the Unicode consortium going overboard in > adding zillions of codepoints of nearly zero usefulness, is in fact > undermining unicode’s popularity and spread." > > Can you demonstrate this? Can you show somebody who says "Well, I was going > to support full Unicode, but since they added a snowman, I'm going to stick > to ASCII"? I gave a list of softwares which goof/break going BMP to 7.0 unicode > > The "whimsical" characters you are complaining about were important enough > to somebody to spend significant amounts of time and money to write up a > proposal, have it go through the Unicode Consortium bureaucracy, and > eventually have it accepted. That's not easy or cheap, and people didn't > add a snowman on a whim. They did it because there are a whole lot of > people who want a shared standard for map symbols. > > It is easy to mock what is not important to you. I daresay kids adding emoji > to their 10 character tweets would mock all the useless maths symbols in > Unicode too. Head para of section 5 has: | However (the following) are (in the standard)! So lets use them! Looks like mocking to you The only mocking is at 5.1. And even here I dont mock the users of these blocks – now or millenia ago. I only mock the unicode consortium for putting them into unicode ---------------------- ¹ And somewhere around here we get into Gödelian problems -- known to programmers under the form "Write a program that prints itself". Likewise Acks. I am going to deal with the Gödel-loop by the device: - Address real issues/objects - Smile at grumpiness
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2015-03-03 20:16 -0800 |
| Message-ID | <debcdbc6-bb2d-4a22-9716-5f6c9afb2f37@googlegroups.com> |
| In reply to | #86882 |
On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote: > On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > > Rustom Mody wrote: > > > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > > >> On 2/26/2015 8:24 AM, Chris Angelico wrote: > > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > > >> >> Wrote something up on why we should stop using ASCII: > > >> >> http://blog.languager.org/2015/02/universal-unicode.html > > >> > > >> I think that the main point of the post, that many Unicode chars are > > >> truly planetary rather than just national/regional, is excellent. > > > > > > <snipped> > > > > > >> You should add emoticons, but not call them or the above 'gibberish'. > > >> I think that this part of your post is more 'unprofessional' than the > > >> character blocks. It is very jarring and seems contrary to your main > > >> point. > > > > > > Ok Done > > > > > > References to gibberish removed from > > > http://blog.languager.org/2015/02/universal-unicode.html > > > > I consider it unethical to make semantic changes to a published work in > > place without acknowledgement. Fixing minor typos or spelling errors, or > > dead links, is okay. But any edit that changes the meaning should be > > commented on, either by an explicit note on the page itself, or by striking > > out the previous content and inserting the new. > > Dunno What you are grumping about… > > Anyway the attribution is made more explicit – footnote 5 in > http://blog.languager.org/2015/03/whimsical-unicode.html. > > Note Terry Reedy's post who mainly objected was already acked earlier. > Ive just added one more ack¹ > And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication. > > > > > As for the content of the essay, it is currently rather unfocused. > > True. > > It > > appears to be more of a list of "here are some Unicode characters I think > > are interesting, divided into subgroups, oh and here are some I personally > > don't have any use for, which makes them silly" than any sort of discussion > > about the universality of Unicode. That makes it rather idiosyncratic and > > parochial. Why should obscure maths symbols be given more importance than > > obscure historical languages? > > Idiosyncratic ≠ parochial > > > > > > I think that the universality of Unicode could be explained in a single > > sentence: > > > > "It is the aim of Unicode to be the one character set anyone needs to > > represent every character, ideogram or symbol (but not necessarily distinct > > glyph) from any existing or historical human language." > > > > I can expand on that, but in a nutshell that is it. > > > > > > You state: > > > > "APL and Z Notation are two notable languages APL is a programming language > > and Z a specification language that did not tie themselves down to a > > restricted charset ..." > > Tsk Tsk – dihonest snipping. I wrote > > | APL and Z Notation are two notable languages APL is a programming language > | and Z a specification language that did not tie themselves down to a > | restricted charset even in the day that ASCII ruled. > > so its clear that the restricted applies to ASCII > > > > You list ideographs such as Cuneiform under "Icons". They are not icons. > > They are a mixture of symbols used for consonants, syllables, and > > logophonetic, consonantal alphabetic and syllabic signs. That sits them > > firmly in the same categories as modern languages with consonants, ideogram > > languages like Chinese, and syllabary languages like Cheyenne. > > Ok changed to iconic. > Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages. > In 2015 when someone sees them and recognizes them, they are 'those things that > Sumerians/Egyptians wrote' No one except a rare expert knows those languages > > > > > Just because native readers of Cuneiform are all dead doesn't make Cuneiform > > unimportant. There are probably more people who need to write Cuneiform > > than people who need to write APL source code. > > > > You make a comment: > > > > "To me – a unicode-layman – it looks unprofessional… Billions of computing > > devices world over, each having billions of storage words having their > > storage wasted on blocks such as these??" > > > > But that is nonsense, and it contradicts your earlier quoting of Dave Angel. > > Why are you so worried about an (illusionary) minor optimization? > > 2 < 4 as far as I am concerned. > [If you disagree one man's illusionary is another's waking] > > > > > Whether code points are allocated or not doesn't affect how much space they > > take up. There are millions of unused Unicode code points today. If they > > are allocated tomorrow, the space your documents take up will not increase > > one byte. > > > > Allocating code points to Cuneiform has not increased the space needed by > > Unicode at all. Two bytes alone is not enough for even existing human > > languages (thanks China). For hardware related reasons, it is faster and > > more efficient to use four bytes than three, so the obvious and "dumb" (in > > the simplest thing which will work) way to store Unicode is UTF-32, which > > takes a full four bytes per code point, regardless of whether there are > > 65537 code points or 1114112. That makes it less expensive than floating > > point numbers, which take eight. Would you like to argue that floating > > point doubles are "unprofessional" and wasteful? > > > > As Dave pointed out, and you apparently agreed with him enough to quote him > > TWICE (once in each of two blog posts), history of computing is full of > > premature optimizations for space. (In fact, some of these may have been > > justified by the technical limitations of the day.) Technically Unicode is > > also limited, but it is limited to over one million code points, 1114112 to > > be exact, although some of them are reserved as invalid for technical > > reasons, and there is no indication that we'll ever run out of space in > > Unicode. > > > > In practice, there are three common Unicode encodings that nearly all > > Unicode documents will use. > > > > * UTF-8 will use between one and (by memory) four bytes per code > > point. For Western European languages, that will be mostly one > > or two bytes per character. > > > > * UTF-16 uses a fixed two bytes per code point in the Basic Multilingual > > Plane, which is enough for nearly all Western European writing and > > much East Asian writing as well. For the rest, it uses a fixed four > > bytes per code point. > > > > * UTF-32 uses a fixed four bytes per code point. Hardly anyone uses > > this as a storage format. > > > > > > In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode > > doesn't change the space used. If you actually include a few hieroglyphs to > > your document, the space increases only by the actual space used by those > > hieroglyphs: four bytes per hieroglyph. At no time does the existence of a > > single hieroglyph in your document force you to expand the non-hieroglyph > > characters to use more space. > > > > > > > What I was trying to say expanded here > > > http://blog.languager.org/2015/03/whimsical-unicode.html > > > > You have at least two broken links, referring to a non-existent page: > > > > http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html > > Thanks corrected > > > > > This essay seems to be even more rambling and unfocused than the first. What > > does the cost of semi-conductor plants have to do with whether or not > > programmers support Unicode in their applications? > > > > Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte > > Order Mark. But if you interpret it as an explicit UTF-8 signature or mark, > > it isn't so silly. If your text begins with the UTF-8 mark, treat it as > > UTF-8. It's no more silly than any other heuristic, like HTML encoding tags > > or text editor's encoding cookies. > > > > Your discussion of "complexifiers and simplifiers" doesn't seem to be > > terribly relevant, or at least if it is relevant, you don't give any reason > > for it. The whole thing about Moore's Law and the cost of semi-conductor > > plants seems irrelevant to Unicode except in the most over-generalised > > sense of "things are bigger today than in the past, we've gone from > > five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point? > > - Most people need only 16 bits. > - Many notable examples of software fail going from 16 to 23. > - If you are a software writer, and you fail going 16 to 23 its ok but try to > give useful errors Uh… 21 Thats what makes 3 chars per 64-bit word a possibility. A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-03-04 19:14 +1100 |
| Message-ID | <54f6bee5$0$11122$c3e8da3@news.astraweb.com> |
| In reply to | #86882 |
Rustom Mody wrote: > On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: >> I consider it unethical to make semantic changes to a published work in >> place without acknowledgement. Fixing minor typos or spelling errors, or >> dead links, is okay. But any edit that changes the meaning should be >> commented on, either by an explicit note on the page itself, or by >> striking out the previous content and inserting the new. > > Dunno What you are grumping about… You published something on a blog. And then you edited it, not to correct a typo, but to make a potentially substantial change to semantics, without noting that fact. I consider that unethical. Reputable journalists also consider it unethical to change a published work in place without comment, that is why if they have to correct an online post or article, they put a note (usually at the bottom of the page) stating the nature of the correction made. E.g. "an earlier version of this story stated blah, which is incorrect and has now been corrected." Putting the correction in another post is not good enough, for obvious reasons. People don't read a blog as a unified single piece, they read it as individual posts. In this case, I *assume* that the change only changes the tone rather than the actual meaning of the text, since I haven't seen the before-and-after versions. I'm making a general comment about the ethics of blogging. > And JFTR the 'publication' (O how archaic!) is the whole blog not a single > page just as it is for any other dead-tree publication. "Any other dead-tree publication"? An internet blog is not a dead-tree publication. And there's nothing archaic about publishing work on the Internet. What a foolish thing to say. >> As for the content of the essay, it is currently rather unfocused. > > True. > > It >> appears to be more of a list of "here are some Unicode characters I think >> are interesting, divided into subgroups, oh and here are some I >> personally don't have any use for, which makes them silly" than any sort >> of discussion about the universality of Unicode. That makes it rather >> idiosyncratic and parochial. Why should obscure maths symbols be given >> more importance than obscure historical languages? > > Idiosyncratic ≠ parochial I know. That's why I said "idiosyncratic and parochial" rather than just picking one. It is both. [...] >> You state: >> >> "APL and Z Notation are two notable languages APL is a programming >> language and Z a specification language that did not tie themselves down >> to a restricted charset ..." > > Tsk Tsk – dihonest snipping. I wrote > > | APL and Z Notation are two notable languages APL is a programming > | language and Z a specification language that did not tie themselves down > | to a restricted charset even in the day that ASCII ruled. > > so its clear that the restricted applies to ASCII It is not clear at all, and in fact ASCII is irrelevant. Even in the days that "ASCII ruled", there were dozens, maybe hundreds of restricted charsets. EBCDIC, national variants of ASCII, mutations of it like PETSCII (used on Commodore machines), 8-bit code pages... APL was invented in 1964, the first public draft of ASCII was 1963 just one year earlier. In 1964, ASCII was not commonly used in computing, it was a seven-bit teleprinter code. ASCII didn't get fully established in computing until 1968, when the US government mandated that starting from 1969 all computers purchased by the government had to support ASCII. When APL was invented, ASCII wasn't even relevant. >> You list ideographs such as Cuneiform under "Icons". They are not icons. >> They are a mixture of symbols used for consonants, syllables, and >> logophonetic, consonantal alphabetic and syllabic signs. That sits them >> firmly in the same categories as modern languages with consonants, >> ideogram languages like Chinese, and syllabary languages like Cheyenne. > > Ok changed to iconic. > Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform o_O People don't speak hieroglyphs, except in Asterisk The Gaul comics. People speak words. > they were languages. In 2015 when someone sees them and recognizes them, > they are 'those things that Sumerians/Egyptians wrote' No one except a > rare expert knows those languages True. But there are people who are not "rare experts" but still have need to use cuneiform or hieroglyphs in their works, just like not everybody who writes about mathematics is "a rare expert" mathematician. >> Just because native readers of Cuneiform are all dead doesn't make >> Cuneiform unimportant. There are probably more people who need to write >> Cuneiform than people who need to write APL source code. >> >> You make a comment: >> >> "To me – a unicode-layman – it looks unprofessional… Billions of >> computing devices world over, each having billions of storage words >> having their storage wasted on blocks such as these??" >> >> But that is nonsense, and it contradicts your earlier quoting of Dave >> Angel. Why are you so worried about an (illusionary) minor optimization? > > 2 < 4 as far as I am concerned. > [If you disagree one man's illusionary is another's waking] You can't have it both ways. You acknowledge that 16-bits are not sufficient for a universal character set, then criticize Unicode for using more than 16-bits. This is inconsistent and foolish. [...] >> Your discussion of "complexifiers and simplifiers" doesn't seem to be >> terribly relevant, or at least if it is relevant, you don't give any >> reason for it. The whole thing about Moore's Law and the cost of >> semi-conductor plants seems irrelevant to Unicode except in the most >> over-generalised sense of "things are bigger today than in the past, >> we've gone from five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So >> what's your point? > > - Most people need only 16 bits. I don't know about "most" people, but there are over one billion Chinese whose native language simply doesn't fit into 16 bits. > - Many notable examples of software fail going from 16 to 23. > - If you are a software writer, and you fail going 16 to 23 its ok but try > to give useful errors No it isn't okay. >> You agree that 16-bits are not enough, and yet you critice Unicode for >> using more than 16-bits on wasteful, whimsical gibberish like Cuneiform? >> That is an inconsistent position to take. > > | ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support > | – > | ASCII. BMP-only Unicode is universal enough but within practical limits > | whereas full (7.0) Unicode is 'really' universal at a cost of > | performance and whimsicality. > > Do you disagree that BMP-only = 16 bits? That point is not in question. Unicode was extended beyond 16 bits because 16 bits *is not enough* even for existing human languages in common use. As for performance, you contradict yourself. You've quoted Dave TWICE about all these artificial limits imposed which turned out to be too low, and here you are doing exactly the same thing. [...] >> You say: >> >> "I just want to suggest that the Unicode consortium going overboard in >> adding zillions of codepoints of nearly zero usefulness, is in fact >> undermining unicode’s popularity and spread." >> >> Can you demonstrate this? Can you show somebody who says "Well, I was >> going to support full Unicode, but since they added a snowman, I'm going >> to stick to ASCII"? > > I gave a list of softwares which goof/break going BMP to 7.0 unicode Irrelevant to my question. You didn't say that Unicode was being undermined by buggy programming languages, you stated it was being undermined by the addition of characters of "nearly zero usefulness". Citation please. >> >> The "whimsical" characters you are complaining about were important >> enough to somebody to spend significant amounts of time and money to >> write up a proposal, have it go through the Unicode Consortium >> bureaucracy, and eventually have it accepted. That's not easy or cheap, >> and people didn't add a snowman on a whim. They did it because there are >> a whole lot of people who want a shared standard for map symbols. >> >> It is easy to mock what is not important to you. I daresay kids adding >> emoji to their 10 character tweets would mock all the useless maths >> symbols in Unicode too. > > Head para of section 5 has: > | However (the following) are (in the standard)! So lets use them! > Looks like mocking to you No. The part where you say they are "gibberish" or "whimsical" and make zero effort to understand why they were added is mocking. The part where your argument basically boils down to "I personally have no need for these characters, therefore the Unicode Consortium is silly for adding them." > The only mocking is at 5.1. And even here I dont mock the users of these > blocks – now or millenia ago. I only mock the unicode consortium for > putting them into unicode Exactly. -- Steve
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-03-04 02:16 -0800 |
| Message-ID | <0b4484c7-b213-49ee-9098-1eeeb3aabcb6@googlegroups.com> |
| In reply to | #86891 |
Le mercredi 4 mars 2015 09:14:42 UTC+1, Steven D'Aprano a écrit : > > o_O > > People don't speak hieroglyphs, except in Asterisk The Gaul comics. People > speak words. > > http://www.asterix.com/asterix-de-a-a-z/les-personnages/tumeheris.html jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-27 04:29 +1100 |
| Message-ID | <mailman.19277.1424971771.18130.python-list@python.org> |
| In reply to | #86495 |
On Fri, Feb 27, 2015 at 4:02 AM, Terry Reedy <tjreedy@udel.edu> wrote: > On 2/26/2015 8:24 AM, Chris Angelico wrote: >> >> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody@gmail.com> >> wrote: >>> >>> Wrote something up on why we should stop using ASCII: >>> http://blog.languager.org/2015/02/universal-unicode.html > > > I think that the main point of the post, that many Unicode chars are truly > planetary rather than just national/regional, is excellent. Agreed. Like you, though, I take exception at the "Gibberish" section. Unicode offers us a number of types of character needed by linguists: 1) Letters[1] common to many languages, such as the unadorned Latin and Cyrillic letters 2) Letters specific to one or very few languages, such as the Turkish dotless i 3) Diacritical marks, ready to be combined with various letters 4) Precomposed forms of various common "letter with diacritical" combinations 5) Other precomposed forms, eg ligatures and Hangul syllables 6) Symbols, punctuation, and various other marks 7) Spacing of various widths and attributes Apart from #4 and #5, which could be avoided by using the decomposed forms everywhere, each of these character types is vital. You can't typeset a document without being able to adequately represent every part of it. Then there are additional characters that aren't strictly necessary, but are extremely convenient, such as the emoticon sections. You can talk in text and still put in a nice little picture of a globe, or the monkey-no-evil set, etc. Most of these characters - in fact, all except #2 and maybe a few of the diacritical marks - are used in multiple places/languages. Unicode isn't about taking everyone's separate character sets and numbering them all so we can reference characters from anywhere; if you wanted that, you'd be much better off with something that lets you specify a code page in 16 bits and a character in 8, which is roughly the same size as Unicode anyway. What we have is, instead, a system that brings them all together - LATIN SMALL LETTER A is U+0061 no matter whether it's being used to write English, French, Malaysian, Turkish, Croatian, Vietnamese, or Icelandic text. Unicode is truly planetary. ChrisA [1] I use the word "letter" loosely here; Chinese and Japanese don't have a concept of letters as such, but their glyphs are still represented.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-02-27 10:09 +1100 |
| Message-ID | <54efa7b6$0$12994$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #86523 |
Chris Angelico wrote: > Unicode > isn't about taking everyone's separate character sets and numbering > them all so we can reference characters from anywhere; if you wanted > that, you'd be much better off with something that lets you specify a > code page in 16 bits and a character in 8, which is roughly the same > size as Unicode anyway. Well, except for the approximately 25% of people in the world whose native language has more than 256 characters. It sounds like you are referring to some sort of "shift code" system. Some legacy East Asian encodings use a similar scheme, and depending on how they are implemented they have great disadvantages. For example, Shift-JIS suffers from a number of weaknesses including that a single byte corrupted in transmission can cause large swaths of the following text to be corrupted. With Unicode, a single corrupted byte can only corrupt a single code point. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-27 10:23 +1100 |
| Message-ID | <mailman.19295.1424993013.18130.python-list@python.org> |
| In reply to | #86553 |
On Fri, Feb 27, 2015 at 10:09 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Chris Angelico wrote: > >> Unicode >> isn't about taking everyone's separate character sets and numbering >> them all so we can reference characters from anywhere; if you wanted >> that, you'd be much better off with something that lets you specify a >> code page in 16 bits and a character in 8, which is roughly the same >> size as Unicode anyway. > > Well, except for the approximately 25% of people in the world whose native > language has more than 256 characters. You could always allocate multiple code pages to one language. But since I'm not advocating this system, I'm only guessing at solutions to its problems. > It sounds like you are referring to some sort of "shift code" system. Some > legacy East Asian encodings use a similar scheme, and depending on how they > are implemented they have great disadvantages. For example, Shift-JIS > suffers from a number of weaknesses including that a single byte corrupted > in transmission can cause large swaths of the following text to be > corrupted. With Unicode, a single corrupted byte can only corrupt a single > code point. That's exactly what I was hinting at. There are plenty of systems like that, and they are badly flawed compared to a simple universal system for a number of reasons. One is the corruption issue you mention; another is that a simple memory-based text search becomes utterly useless (to locate text in a document, you'd need to do a whole lot of stateful parsing - not to mention the difficulties of doing "similar-to" searches across languages); concatenation of text also becomes a stateful operation, and so do all sorts of other simple manipulations. Unicode may demand a bit more storage in certain circumstances (where an eight-bit encoding might have handled your entire document), but it's so much easier for the general case. ChrisA
[toc] | [prev] | [standalone]
Page 8 of 8 — ← Prev page 1 2 3 4 5 6 7 [8]
Back to top | Article view | comp.lang.python
csiph-web