Groups > comp.lang.python > #86311 > unrolled thread

Newbie question about text encoding

Started by	pierrick.brihaye@gmail.com
First post	2015-02-24 02:49 -0800
Last post	2015-02-27 10:23 +1100
Articles	20 on this page of 158 — 19 participants

Back to article view | Back to comp.lang.python

  Newbie question about text encoding pierrick.brihaye@gmail.com - 2015-02-24 02:49 -0800
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-24 22:09 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 06:25 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 15:55 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:03 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:06 +0100
      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-24 08:01 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:07 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:10 +1100
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 16:24 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 02:33 +1100
    Re: Newbie question about text encoding random832@fastmail.us - 2015-02-24 10:38 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 17:20 +0100
    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-25 03:24 +1100
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 12:13 -0500
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:45 +0100
      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-02-25 00:21 +0200
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:20 +1100
        Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-25 06:34 -0800
    Re: Newbie question about text encoding Laura Creighton <lac@openend.se> - 2015-02-24 20:57 +0100
      Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-25 12:19 +1100
        Re: Newbie question about text encoding Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com> - 2015-02-25 12:54 +0800
    Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-24 15:41 -0500
      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 04:40 -0800
        Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 05:15 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 00:24 +1100
          Re: Newbie question about text encoding Sam Raker <sam.raker@gmail.com> - 2015-02-26 08:45 -0800
            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:08 -0800
        Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-02-26 12:02 -0500
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 09:59 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-26 12:20 -0800
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 09:13 +1100
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 12:05 +1100
              Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-26 20:57 -0500
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 16:58 +1100
                  Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 02:30 -0500
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 22:54 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 09:02 -0500
                      Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 01:22 +1100
                        Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:00 +0000
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 03:12 +1100
                            Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 16:45 +0000
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 04:45 +1100
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:13 +0000
                              Re: Newbie question about text encoding MRAB <python@mrabarnett.plus.com> - 2015-02-27 19:14 +0000
                                Re: Newbie question about text encoding alister <alister.nospam.ware@ntlworld.com> - 2015-02-27 22:09 +0000
                          Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 15:52 -0500
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-28 08:04 +1100
                      Re: Newbie question about text encoding Dave Angel <davea@davea.name> - 2015-02-27 10:24 -0500
                      Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:46 +0000
                        Re: Newbie question about text encoding Grant Edwards <invalid@invalid.invalid> - 2015-02-27 17:47 +0000
              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-02-27 01:06 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-02-26 11:59 -0800
          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 10:03 -0800
            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-03 10:36 -0800
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:45 -0800
                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 15:54 +1100
                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 21:05 -0800
                    Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 01:06 +1100
                      Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-05 06:59 -0800
                      Re: Newbie question about text encoding random832@fastmail.us - 2015-03-05 14:59 -0500
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-06 09:33 +1100
                      Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-05 20:53 -0800
                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 16:20 +1100
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:02 -0800
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 01:06 -0800
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 08:33 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 00:39 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:03 -0500
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 01:11 +1100
                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-06 09:27 -0500
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 03:26 +1100
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-06 20:54 +1100
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 02:07 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 01:50 +1100
                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 02:27 +1100
                              Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 07:37 -0800
                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-06 08:20 -0800
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 03:45 +1100
                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:41 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 11:58 -0800
                                Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-07 01:11 -0500
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-06 23:43 -0800
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 00:55 -0800
                                    Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 01:08 -0800
                                  Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:25 -0800
                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-07 22:09 +1100
                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 22:33 +1100
                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 13:53 +0200
                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-07 23:02 +1100
                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:07 +0000
                            Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 07:28 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 02:40 +1100
                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 17:48 +0200
                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:17 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:25 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:41 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:54 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:58 +1100
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:00 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:14 +0200
                                            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:26 +1100
                                              Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:50 +0200
                                                Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 04:59 +1100
                                                  Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:02 +0000
                                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:13 +1100
                                                      Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 18:34 +0000
                                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 05:44 +1100
                                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 19:00 +0000
                                                          Re: Newbie question about text encoding Dan Sommers <dan@tombstonezero.net> - 2015-03-07 19:16 +0000
                                                        Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 21:01 +0200
                                    Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 16:40 +0000
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 18:48 +0200
                                        Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 17:02 +0000
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-07 19:16 +0200
                                            Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 18:18 +0000
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 21:06 -0800
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 03:53 +1100
                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-07 11:03 -0800
                                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 12:45 +1100
                                  Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 09:20 +0200
                                    Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 18:37 +1100
                                      Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 10:09 +0200
                                        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-08 19:23 +1100
                                          Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-08 01:18 -0800
                                        Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:25 +1100
                                          Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-08 22:09 +0200
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 12:43 +1100
                                              Re: Newbie question about text encoding Ben Finney <ben+python@benfinney.id.au> - 2015-03-09 13:09 +1100
                                                Re: Newbie question about text encoding Marko Rauhamaa <marko@pacujo.net> - 2015-03-09 08:31 +0200
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 13:18 +1100
                                              Re: Newbie question about text encoding random832@fastmail.us - 2015-03-09 00:27 -0400
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 07:55 +1100
                                          Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 08:13 +1100
                                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 17:34 +1100
                                              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-09 17:44 +1100
                                                Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 02:08 -0700
                                                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-09 07:26 -0700
                                              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-09 05:28 -0700
                                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-08 19:01 +1100
                          Re: Newbie question about text encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-03-07 14:13 +0000
                          Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-07 23:23 -0800
                            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-09 05:30 +1100
                          Re: Newbie question about text encoding Cameron Simpson <cs@zip.com.au> - 2015-03-09 13:09 +1100
                            Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-08 19:42 -0700
                  Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:16 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 05:43 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 18:53 -0800
            Re: Newbie question about text encoding Terry Reedy <tjreedy@udel.edu> - 2015-03-03 18:30 -0500
            Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 13:54 +1100
              Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-03-04 14:02 +1100
              Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:05 -0800
                Re: Newbie question about text encoding Rustom Mody <rustompmody@gmail.com> - 2015-03-03 20:16 -0800
                Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-03-04 19:14 +1100
                  Re: Newbie question about text encoding wxjmfauth@gmail.com - 2015-03-04 02:16 -0800
        Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 04:29 +1100
          Re: Newbie question about text encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-27 10:09 +1100
            Re: Newbie question about text encoding Chris Angelico <rosuav@gmail.com> - 2015-02-27 10:23 +1100

Page 2 of 8 — ← Prev page 1 [2] 3 4 5 6 7 8 Next page →

#86367

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-25 12:19 +1100
Message-ID	<54ed232f$0$13004$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86341

Laura Creighton wrote:

> Dave Angel
> are you another Native English speaker living in a world where ASCII
> is enough?

ASCII was never enough. Not even for Americans, who couldn't write things
like "I bought a comic book for 10¢ yesterday", let alone interesting
things from maths and science.

I missed the whole 7-bit ASCII period, my first computer (Mac 128K) already
had an extended character set beyond ASCII. But even that never covered the
full range of characters I wanted to write, and then there was the horrible
mess that you got whenever you copied text files from a Mac to a DOS or
Windows PC or visa versa. Yes, even in 1984 we were transferring files and
running into encoding issues.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86373

From	Marcos Almeida Azevedo <marcos.al.azevedo@gmail.com>
Date	2015-02-25 12:54 +0800
Message-ID	<mailman.19166.1424840069.18130.python-list@python.org>
In reply to	#86367

[Multipart message — attachments visible in raw view] — view raw

On Wed, Feb 25, 2015 at 9:19 AM, Steven D'Aprano <
steve+comp.lang.python@pearwood.info> wrote:

> Laura Creighton wrote:
>
> > Dave Angel
> > are you another Native English speaker living in a world where ASCII
> > is enough?
>
> ASCII was never enough. Not even for Americans, who couldn't write things
> like "I bought a comic book for 10¢ yesterday", let alone interesting
> things from maths and science.
>
>
ASCII was a necessity back then because RAM and storage are too small.


> I missed the whole 7-bit ASCII period, my first computer (Mac 128K) already
> had an extended character set beyond ASCII. But even that never covered the
>

I miss the days when I was coding with my XT computer (640kb RAM) too.
Things were so simple back then.


> full range of characters I wanted to write, and then there was the horrible
> mess that you got whenever you copied text files from a Mac to a DOS or
> Windows PC or visa versa. Yes, even in 1984 we were transferring files and
> running into encoding issues.
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>



-- 
Marcos | I love PHP, Linux, and Java
<http://javadevnotes.com/java-integer-to-string-examples>

[toc] | [prev] | [next] | [standalone]

#86343

From	Dave Angel <davea@davea.name>
Date	2015-02-24 15:41 -0500
Message-ID	<mailman.19148.1424810518.18130.python-list@python.org>
In reply to	#86311

On 02/24/2015 02:57 PM, Laura Creighton wrote:
> Dave Angel
> are you another Native English speaker living in a world where ASCII
> is enough?

I'm a native English speaker, and 7 bits is not nearly enough.  Even if 
I didn't currently care, I have some history:

No.  CDC display code is enough. Who needs lowercase?

No.  Baudot code is enough.

No, EBCDIC is good enough.  Who cares about other companies.

No, the "golf-ball" only holds this many characters.  If we need more, 
we can just get the operator to switch balls in the middle of printing.

No. 2 digit years is enough.  This world won't last till the millennium 
anyway.

No.  2k is all the EPROM you can have.  Your code HAS to fit in it, and 
only 1.5k RAM.

No.  640k is more than anyone could need.

No, you cannot use a punch card made on a model 26 keypunch in the same 
deck as one made on a model 29.  Too bad, many of the codes are 
different.  (This one cost me travel back and forth between two 
different locations with different model keypunches)

No. 8 bits is as much as we could ever use for characters.  Who could 
possibly need names or locations outside of this region?  Or from 
multiple places within it?

35 years ago I helped design a serial terminal that "spoke" Chinese, 
using a two-byte encoding.  But a single worldwide standard didn't come 
until much later, and I cheered Unicode when it was finally unveiled.

I've worked with many printers that could only print 70 or 80 unique 
characters.  The laser printer, and even the matrix printer are 
relatively recent inventions.

Getting back on topic:

According to:
    http://support.esri.com/cn/knowledgebase/techarticles/detail/27345

"""ArcGIS Desktop applications, such as ArcMap, are Unicode based, so 
they support Unicode to a certain level. The level of Unicode support 
depends on the data format."""

That page was written about 2004, so there was concern even then.

And according to another, """In the header of each shapefile (.DBF), a 
reference to a code page is included."""

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86495

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-02-26 04:40 -0800
Message-ID	<ef520397-b1f0-47bf-8d24-585a9ba230e2@googlegroups.com>
In reply to	#86343

On Wednesday, February 25, 2015 at 2:12:09 AM UTC+5:30, Dave Angel wrote:
> On 02/24/2015 02:57 PM, Laura Creighton wrote:
> > Dave Angel
> > are you another Native English speaker living in a world where ASCII
> > is enough?
> 
> I'm a native English speaker, and 7 bits is not nearly enough.  Even if 
> I didn't currently care, I have some history:
> 
> No.  CDC display code is enough. Who needs lowercase?
> 
> No.  Baudot code is enough.
> 
> No, EBCDIC is good enough.  Who cares about other companies.
> 
> No, the "golf-ball" only holds this many characters.  If we need more, 
> we can just get the operator to switch balls in the middle of printing.
> 
> No. 2 digit years is enough.  This world won't last till the millennium 
> anyway.
> 
> No.  2k is all the EPROM you can have.  Your code HAS to fit in it, and 
> only 1.5k RAM.
> 
> No.  640k is more than anyone could need.
> 
> No, you cannot use a punch card made on a model 26 keypunch in the same 
> deck as one made on a model 29.  Too bad, many of the codes are 
> different.  (This one cost me travel back and forth between two 
> different locations with different model keypunches)
> 
> No. 8 bits is as much as we could ever use for characters.  Who could 
> possibly need names or locations outside of this region?  Or from 
> multiple places within it?
> 
> 35 years ago I helped design a serial terminal that "spoke" Chinese, 
> using a two-byte encoding.  But a single worldwide standard didn't come 
> until much later, and I cheered Unicode when it was finally unveiled.
> 
> I've worked with many printers that could only print 70 or 80 unique 
> characters.  The laser printer, and even the matrix printer are 
> relatively recent inventions.

Wrote something up on why we should stop using ASCII:
http://blog.languager.org/2015/02/universal-unicode.html

(Yeah the world is a bit larger than a small bunch of islands off a half-continent.
But this is not that discussion!)

[toc] | [prev] | [next] | [standalone]

#86498

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-02-26 05:15 -0800
Message-ID	<7294b798-17a4-4743-901b-e189d37033e5@googlegroups.com>
In reply to	#86495

On Thursday, February 26, 2015 at 6:10:25 PM UTC+5:30, Rustom Mody wrote:
> On Wednesday, February 25, 2015 at 2:12:09 AM UTC+5:30, Dave Angel wrote:
> > On 02/24/2015 02:57 PM, Laura Creighton wrote:
> > > Dave Angel
> > > are you another Native English speaker living in a world where ASCII
> > > is enough?
> > 
> > I'm a native English speaker, and 7 bits is not nearly enough.  Even if 
> > I didn't currently care, I have some history:
> > 
> > No.  CDC display code is enough. Who needs lowercase?
> > 
> > No.  Baudot code is enough.
> > 
> > No, EBCDIC is good enough.  Who cares about other companies.
> > 
> > No, the "golf-ball" only holds this many characters.  If we need more, 
> > we can just get the operator to switch balls in the middle of printing.
> > 
> > No. 2 digit years is enough.  This world won't last till the millennium 
> > anyway.
> > 
> > No.  2k is all the EPROM you can have.  Your code HAS to fit in it, and 
> > only 1.5k RAM.
> > 
> > No.  640k is more than anyone could need.
> > 
> > No, you cannot use a punch card made on a model 26 keypunch in the same 
> > deck as one made on a model 29.  Too bad, many of the codes are 
> > different.  (This one cost me travel back and forth between two 
> > different locations with different model keypunches)
> > 
> > No. 8 bits is as much as we could ever use for characters.  Who could 
> > possibly need names or locations outside of this region?  Or from 
> > multiple places within it?
> > 
> > 35 years ago I helped design a serial terminal that "spoke" Chinese, 
> > using a two-byte encoding.  But a single worldwide standard didn't come 
> > until much later, and I cheered Unicode when it was finally unveiled.
> > 
> > I've worked with many printers that could only print 70 or 80 unique 
> > characters.  The laser printer, and even the matrix printer are 
> > relatively recent inventions.
> 
> Wrote something up on why we should stop using ASCII:
> http://blog.languager.org/2015/02/universal-unicode.html

Dave's list above of instances of 'poverty is a good idea' turning out stupid and narrow-minded in hindsight is neat.  Thought I'd ack that explicitly.

[toc] | [prev] | [next] | [standalone]

#86499

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-27 00:24 +1100
Message-ID	<mailman.19255.1424957046.18130.python-list@python.org>
In reply to	#86495

On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> Wrote something up on why we should stop using ASCII:
> http://blog.languager.org/2015/02/universal-unicode.html

>From that post:

"""
5.1 Gibberish

When going from the original 2-byte unicode (around version 3?) to the
one having supplemental planes, the unicode consortium added blocks
such as

* Egyptian hieroglyphs
* Cuneiform
* Shavian
* Deseret
* Mahjong
* Klingon

To me (a layman) it looks unprofessional – as though they are playing
games – that billions of computing devices, each having billions of
storage words should have their storage wasted on blocks such as
these.
"""

The shift from Unicode as a 16-bit code to having multiple planes came
in with Unicode 2.0, but the various blocks were assigned separately:
* Egyptian hieroglyphs: Unicode 5.2
* Cuneiform: Unicode 5.0
* Shavian: Unicode 4.0
* Deseret: Unicode 3.1
* Mahjong Tiles: Unicode 5.1
* Klingon: Not part of any current standard

However, I don't think historians will appreciate you calling all of
these "gibberish". To adequately describe and discuss old texts
without these Unicode blocks, we'd have to either do everything with
images, or craft some kind of reversible transliteration system and
have dedicated software to render the texts on screen. Instead, what
we have is a well-known and standardized system for transliterating
all of these into numbers (code points), and rendering them becomes a
simple matter of installing an appropriate font.

Also, how does assigning meanings to codepoints "waste storage"? As
soon as Unicode 2.0 hit and 16-bit code units stopped being
sufficient, everyone needed to allocate storage - either 32 bits per
character, or some other system - and the fact that some codepoints
were unassigned had absolutely no impact on that. This is decidedly
NOT unprofessional, and it's not wasteful either.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86514

From	Sam Raker <sam.raker@gmail.com>
Date	2015-02-26 08:45 -0800
Message-ID	<0c1f4147-2d5d-4fa6-afb9-2275d878e2c1@googlegroups.com>
In reply to	#86499

I'm 100% in favor of expanding Unicode until the sun goes dark. Doing so helps solve the problems affecting speakers of "underserved" languages--access and language preservation. Speakers of Mongolian, Cherokee, Georgian, etc. all deserve to be able to interact with technology in their native languages as much as we speakers of ASCII-friendly languages do. Unicode support also makes writing papers on, dictionaries of, and new texts in such languages much easier, which helps the fight against language extinction, which is a sadly pressing issue.

Also, like, computers are big. Get an external drive for your high-resolution PDF collection of Medieval manuscripts if you feel like you're running out of space. A few extra codepoints aren't going to be the straw that breaks the camel's back.


On Thursday, February 26, 2015 at 8:24:34 AM UTC-5, Chris Angelico wrote:
> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> > Wrote something up on why we should stop using ASCII:
> > http://blog.languager.org/2015/02/universal-unicode.html
> 
> >From that post:
> 
> """
> 5.1 Gibberish
> 
> When going from the original 2-byte unicode (around version 3?) to the
> one having supplemental planes, the unicode consortium added blocks
> such as
> 
> * Egyptian hieroglyphs
> * Cuneiform
> * Shavian
> * Deseret
> * Mahjong
> * Klingon
> 
> To me (a layman) it looks unprofessional - as though they are playing
> games - that billions of computing devices, each having billions of
> storage words should have their storage wasted on blocks such as
> these.
> """
> 
> The shift from Unicode as a 16-bit code to having multiple planes came
> in with Unicode 2.0, but the various blocks were assigned separately:
> * Egyptian hieroglyphs: Unicode 5.2
> * Cuneiform: Unicode 5.0
> * Shavian: Unicode 4.0
> * Deseret: Unicode 3.1
> * Mahjong Tiles: Unicode 5.1
> * Klingon: Not part of any current standard
> 
> However, I don't think historians will appreciate you calling all of
> these "gibberish". To adequately describe and discuss old texts
> without these Unicode blocks, we'd have to either do everything with
> images, or craft some kind of reversible transliteration system and
> have dedicated software to render the texts on screen. Instead, what
> we have is a well-known and standardized system for transliterating
> all of these into numbers (code points), and rendering them becomes a
> simple matter of installing an appropriate font.
> 
> Also, how does assigning meanings to codepoints "waste storage"? As
> soon as Unicode 2.0 hit and 16-bit code units stopped being
> sufficient, everyone needed to allocate storage - either 32 bits per
> character, or some other system - and the fact that some codepoints
> were unassigned had absolutely no impact on that. This is decidedly
> NOT unprofessional, and it's not wasteful either.
> 
> ChrisA

[toc] | [prev] | [next] | [standalone]

#86520

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-02-26 09:08 -0800
Message-ID	<be5cfdba-14d6-4387-b959-4474f60d06c5@googlegroups.com>
In reply to	#86514

On Thursday, February 26, 2015 at 10:16:11 PM UTC+5:30, Sam Raker wrote:
> I'm 100% in favor of expanding Unicode until the sun goes dark. Doing so helps solve the problems affecting speakers of "underserved" languages--access and language preservation. Speakers of Mongolian, Cherokee, Georgian, etc. all deserve to be able to interact with technology in their native languages as much as we speakers of ASCII-friendly languages do. Unicode support also makes writing papers on, dictionaries of, and new texts in such languages much easier, which helps the fight against language extinction, which is a sadly pressing issue.

Agreed -- Correcting the inequities caused by ASCII-bias is a good thing.

In fact the whole point of my post was to say just that by carving out and 
focussing on a 'universal' subset of unicode that is considerably larger than 
ASCII but smaller than unicode, we stand to reduce ASCII-bias.

As also other posts like
http://blog.languager.org/2014/04/unicoded-python.html
http://blog.languager.org/2014/05/unicode-in-haskell-source.html

However my example listed

> > * Egyptian hieroglyphs
> > * Cuneiform
> > * Shavian
> > * Deseret
> > * Mahjong
> > * Klingon

Ok Chris has corrected me re. Klingon-in-unicode. So lets drop that.
Of the others which do you thing is in 'underserved' category?

More generally which of http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Supplementary_Multilingual_Plane
are underserved?

[toc] | [prev] | [next] | [standalone]

#86519

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-02-26 12:02 -0500
Message-ID	<mailman.19274.1424970167.18130.python-list@python.org>
In reply to	#86495

On 2/26/2015 8:24 AM, Chris Angelico wrote:
> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rustompmody@gmail.com> wrote:
>> Wrote something up on why we should stop using ASCII:
>> http://blog.languager.org/2015/02/universal-unicode.html

I think that the main point of the post, that many Unicode chars are 
truly planetary rather than just national/regional, is excellent.

>  From that post:
>
> """
> 5.1 Gibberish
>
> When going from the original 2-byte unicode (around version 3?) to the
> one having supplemental planes, the unicode consortium added blocks
> such as
>
> * Egyptian hieroglyphs
> * Cuneiform
> * Shavian
> * Deseret
> * Mahjong
> * Klingon
>
> To me (a layman) it looks unprofessional – as though they are playing
> games – that billions of computing devices, each having billions of
> storage words should have their storage wasted on blocks such as
> these.
> """
>
> The shift from Unicode as a 16-bit code to having multiple planes came
> in with Unicode 2.0, but the various blocks were assigned separately:
> * Egyptian hieroglyphs: Unicode 5.2
> * Cuneiform: Unicode 5.0
> * Shavian: Unicode 4.0
> * Deseret: Unicode 3.1
> * Mahjong Tiles: Unicode 5.1
> * Klingon: Not part of any current standard

You should add emoticons, but not call them or the above 'gibberish'.
I think that this part of your post is more 'unprofessional' than the 
character blocks.  It is very jarring and seems contrary to your main point.

> However, I don't think historians will appreciate you calling all of
> these "gibberish". To adequately describe and discuss old texts
> without these Unicode blocks, we'd have to either do everything with
> images, or craft some kind of reversible transliteration system and
> have dedicated software to render the texts on screen. Instead, what
> we have is a well-known and standardized system for transliterating
> all of these into numbers (code points), and rendering them becomes a
> simple matter of installing an appropriate font.
>
> Also, how does assigning meanings to codepoints "waste storage"? As
> soon as Unicode 2.0 hit and 16-bit code units stopped being
> sufficient, everyone needed to allocate storage - either 32 bits per
> character, or some other system - and the fact that some codepoints
> were unassigned had absolutely no impact on that. This is decidedly
> NOT unprofessional, and it's not wasteful either.

I agree.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#86526

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-02-26 09:59 -0800
Message-ID	<00fbd940-52f6-44e2-bf08-b9f35c12e73f@googlegroups.com>
In reply to	#86519

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> Wrote something up on why we should stop using ASCII:
> >> http://blog.languager.org/2015/02/universal-unicode.html
> 
> I think that the main point of the post, that many Unicode chars are 
> truly planetary rather than just national/regional, is excellent.
> 
> >  From that post:
> >
> > """
> > 5.1 Gibberish
> >
> > When going from the original 2-byte unicode (around version 3?) to the
> > one having supplemental planes, the unicode consortium added blocks
> > such as
> >
> > * Egyptian hieroglyphs
> > * Cuneiform
> > * Shavian
> > * Deseret
> > * Mahjong
> > * Klingon
> >
> > To me (a layman) it looks unprofessional – as though they are playing
> > games – that billions of computing devices, each having billions of
> > storage words should have their storage wasted on blocks such as
> > these.
> > """
> >
> > The shift from Unicode as a 16-bit code to having multiple planes came
> > in with Unicode 2.0, but the various blocks were assigned separately:
> > * Egyptian hieroglyphs: Unicode 5.2
> > * Cuneiform: Unicode 5.0
> > * Shavian: Unicode 4.0
> > * Deseret: Unicode 3.1
> > * Mahjong Tiles: Unicode 5.1
> > * Klingon: Not part of any current standard
> 
> You should add emoticons, but not call them or the above 'gibberish'.


Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
In any case I'd like to stay clear of political(izable) questions


> I think that this part of your post is more 'unprofessional' than the 
> character blocks.  It is very jarring and seems contrary to your main point.

Ok I need a word for
1. I have no need for this
2. 99.9% of the (living) on this planet also have no need for this

> 
> > However, I don't think historians will appreciate you calling all of
> > these "gibberish". To adequately describe and discuss old texts
> > without these Unicode blocks, we'd have to either do everything with
> > images, or craft some kind of reversible transliteration system and
> > have dedicated software to render the texts on screen. Instead, what
> > we have is a well-known and standardized system for transliterating
> > all of these into numbers (code points), and rendering them becomes a
> > simple matter of installing an appropriate font.
> >
> > Also, how does assigning meanings to codepoints "waste storage"? As
> > soon as Unicode 2.0 hit and 16-bit code units stopped being
> > sufficient, everyone needed to allocate storage - either 32 bits per
> > character, or some other system - and the fact that some codepoints
> > were unassigned had absolutely no impact on that. This is decidedly
> > NOT unprofessional, and it's not wasteful either.
> 
> I agree.

I clearly am more enthusiastic than knowledgeable about unicode.
But I know my basic CS well enough (as I am sure you and Chris also do)

So I dont get how 4 bytes is not more expensive than 2.
Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
You could use a clever representation like UTF-8 or FSR.
But I dont see how you can get out of this that full-unicode costs more than
exclusive BMP.

eg consider the case of 32 vs 64 bit executables.
The 64 bit executable is generally larger than the 32 bit one
Now consider the case of a machine that has say 2GB RAM and a 64-bit processor.
You could -- I think -- make a reasonable case that all those all-zero hi-address-words are 'waste'.

And youve got the general sense best so far:
> I think that the main point of the post, that many Unicode chars are
> truly planetary rather than just national/regional, 

And if the general tone/tenor of what I have written is probably not getting 
across by some words (like 'gibberish'?) so I'll try and reword.

However let me try and clarify that the whole of section 5 is 'iffy' with 5.1 being only more extreme.  Ive not written these in because the point of that
post is not to criticise unicode but to highlight the universal(isable) parts.

Still if I were to expand on the criticisms here are some examples:

Math-Greek: Consider the math-alpha block
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block

Now imagine a beginning student not getting the difference between font, glyph,
character.  To me this block represents this same error cast into concrete and
dignified by the (supposed) authority of the unicode consortium.

There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

My real reservations about unicode come from their work in areas that I happen to know something about

Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional --  multi-voiced, multi-staff, chordal

Sanskrit/Devanagari:
Consists of bogus letters that dont exist in devanagari
The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
So I call it bogus-devanagari

Contrariwise an important letter in vedic pronunciation the double-udatta is missing
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

In any case all of the above is contrary to /irrelevant to my post which is about
identifying the more universal parts of unicode

[toc] | [prev] | [next] | [standalone]

#86542

From	wxjmfauth@gmail.com
Date	2015-02-26 12:20 -0800
Message-ID	<5b002d84-3ad8-4a4a-8852-69ed93b45ff3@googlegroups.com>
In reply to	#86526

Le jeudi 26 février 2015 18:59:24 UTC+1, Rustom Mody a écrit :
> 
> ...To me this block represents this same error cast into concrete and
> dignified by the (supposed) authority of the unicode consortium.
> 

Unicode does not prescribe, it registrates.

Eg. The inclusion of
U+1E9E, 'LATIN CAPITAL LETTER SHARP S'
has been officialy proposed by the "German
Federal Government".
(I have a pdf copy somewhere).

[toc] | [prev] | [next] | [standalone]

#86551

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-27 09:13 +1100
Message-ID	<mailman.19293.1424988840.18130.python-list@python.org>
In reply to	#86526

On Fri, Feb 27, 2015 at 4:59 AM, Rustom Mody <rustompmody@gmail.com> wrote:
> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main point.
>
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

So what, seven million people need it? Sounds pretty useful to me. And
your figure is an exaggeration; a lot more people than that use
emoji/emoticons.

>> > Also, how does assigning meanings to codepoints "waste storage"? As
>> > soon as Unicode 2.0 hit and 16-bit code units stopped being
>> > sufficient, everyone needed to allocate storage - either 32 bits per
>> > character, or some other system - and the fact that some codepoints
>> > were unassigned had absolutely no impact on that. This is decidedly
>> > NOT unprofessional, and it's not wasteful either.
>>
>> I agree.
>
> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
>
> So I dont get how 4 bytes is not more expensive than 2.
> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more than
> exclusive BMP.

Sure, UCS-2 is cheaper than the current Unicode spec. But Unicode 2.0
was when that changed, and the change was because 65536 characters
clearly wouldn't be enough - and that was due to the number of
characters needed for other things than those you're complaining
about. Every spec since then has not changed anything that affects
storage. There are still, today, quite a lot of unallocated blocks of
characters (we're really using only about two planes' worth so far,
maybe three), but even if Unicode specified just two planes of 64K
characters each, you wouldn't be able to save much on transmission
(UTF-8 is already flexible and uses only what you need; if a future
Unicode spec allows 64K planes, UTF-8 transmission will cost exactly
the same for all existing characters), and on an eight-bit-byte
system, the very best you'll be able to do is three bytes - which you
can do today, too; you already know 21 bits will do. So since the BMP
was proven insufficient (back in 1996), no subsequent changes have had
any costs in storage.

> Still if I were to expand on the criticisms here are some examples:
>
> Math-Greek: Consider the math-alpha block
> http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
>
> Now imagine a beginning student not getting the difference between font, glyph,
> character.  To me this block represents this same error cast into concrete and
> dignified by the (supposed) authority of the unicode consortium.
>
> There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

A lot of these kinds of characters come from a need to unambiguously
transliterate text stored in other encodings. I don't personally
profess to understand the reasoning behind the various
indistinguishable characters, but I'm aware that there are a lot of
tricky questions to be decided; and if once the Consortium decides to
allocate a character, that character must remain forever allocated.

> My real reservations about unicode come from their work in areas that I happen to know something about
>
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
> However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional --  multi-voiced, multi-staff, chordal

The placement on the page is up to the display library. You can
produce a PDF that places the note symbols at their correct positions,
and requires no images to render sheet music.

> Sanskrit/Devanagari:
> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari
>
> Contrariwise an important letter in vedic pronunciation the double-udatta is missing
> http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html
>
> All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

Which proves that they're not perfect. Don't forget, they can always
add more characters later.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86557

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-27 12:05 +1100
Message-ID	<54efc2c8$0$12986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86526

Rustom Mody wrote:

> Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
> In any case I'd like to stay clear of political(izable) questions

Emoji is the term used in Japan, gradually spreading to the rest of the
word. Emoticons, I believe, should be restricted to the practice of using
ASCII-only digraphs and trigraphs such as :-) (colon, hyphen, right-parens)
to indicate "smileys".

I believe that emoji will eventually lead to Unicode's victory. People will
want smileys and piles of poo on their mobile phones, and from there it
will gradually spread to everywhere. All they need to do to make victory
inevitable is add cartoon genitals...

>> I think that this part of your post is more 'unprofessional' than the
>> character blocks.  It is very jarring and seems contrary to your main
>> point.
> 
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

0.1% of the living is seven million people. I'll tell you what, you tell me
which seven million people should be relegated to second-class status, and
I'll tell them where you live.

:-)

[...]
> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
> 
> So I dont get how 4 bytes is not more expensive than 2.

Obviously it is. But it's only twice as expensive, and in computer science
terms that counts as "close enough". It's quite common for data structures
to "waste" space by using "no more than twice as much space as needed",
e.g. Python dicts and lists.

The whole Unicode range U+0000 to U+10FFFF needs only 21 bits, which fits
into three bytes. Nevertheless, there's no three-byte UTF encoding, because
on modern hardware it is more efficient to "waste" an entire extra byte per
code point and deal with an even multiple of bytes.

> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more
> than exclusive BMP.

Are you missing a word there? Costs "no more" perhaps?

> eg consider the case of 32 vs 64 bit executables.
> The 64 bit executable is generally larger than the 32 bit one
> Now consider the case of a machine that has say 2GB RAM and a 64-bit
> processor. You could -- I think -- make a reasonable case that all those
> all-zero hi-address-words are 'waste'.

Sure. The whole point of 64-bit processors is to enable the use of more than
2GB of RAM. One might as well say that using 32-bit processors is wasteful
if you only have 64K of memory. Yes it is, but the only things which use
16-bit or 8-bit processors these days are embedded devices.

[...] 
> Math-Greek: Consider the math-alpha block
>
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
> 
> Now imagine a beginning student not getting the difference between font,
> glyph,
> character.  To me this block represents this same error cast into concrete 
> and dignified by the (supposed) authority of the unicode consortium.

Not being privy to the internal deliberations of the Consortium, it is
sometimes difficult to tell why two symbols are sometimes declared to be
mere different glyphs for the same character, and other times declared to
be worthy of being separate characters.

E.g. I think we should all agree that the English "A" and the French "A"
shouldn't count as separate characters, although the Greek "Α" and
Russian "А" do.

In the case of the maths symbols, it isn't obvious to me what the deciding
factors were. I know that one of the considerations they use is to consider
whether or not users of the symbols have a tradition of treating the
symbols as mere different glyphs, i.e. stylistic variations. In this case,
I'm pretty sure that mathematicians would *not* consider:

U+2115 DOUBLE-STRUCK CAPITAL N "ℕ"
U+004E LATIN CAPITAL LETTER N "N"

as mere stylistic variations. If you defined a matrix called ℕ, you would
probably be told off for using the wrong symbol, not for using the wrong
formatting.

On the other hand, I'm not so sure about 

U+210E PLANCK CONSTANT "ℎ"

versus a mere lowercase h (possibly in italic).

> There are probably dozens of other such stupidities like distinguishing
> kelvin K from latin K as if that is the business of the unicode consortium

But it *is* the business of the Unicode consortium. They have at least two
important aims:

- to be able to represent every possible human-language character;

- to allow lossless round-trip conversion to all existing legacy encodings
  (for the subset of Unicode handled by that encoding).

The second reason is why Unicode includes code points for degree-Celsius and
degree-Fahrenheit, rather than just using °C and °F like sane people.
Because some idiot^W code-page designer back in the 1980s or 90s decided to
add single character ℃ and ℉. So now Unicode has to be able to round-trip
(say) "°C℃" without loss.

I imagine that the same applies to U+212A KELVIN SIGN K.

> My real reservations about unicode come from their work in areas that I
> happen to know something about
> 
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪
> ♫ is perhaps ok However all this stuff
> http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written
> in staff notation) is inherently 2 dimensional --  multi-voiced,
> multi-staff, chordal

(1) Text can also be two dimensional.
(2) Where you put the symbol on the page is a separate question from whether
or not the symbol exists.

> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari

Hmm, well I love Wikipedia as much as the next guy, but I think that even
Jimmy Wales would suggest that Wikipedia is not a primary source for what
counts as Devanagari vowels. What makes you think that Wikipedia is right
and Unicode is wrong?

That's not to say that Unicode hasn't made some mistakes. There are a few
deprecated code points, or code points that have been given the wrong name.
Oops. Mistakes happen.

> Contrariwise an important letter in vedic pronunciation the double-udatta
> is missing
>
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

I quote:

    I do not see any need for a "double udaatta". Perhaps "double 
    ANudaatta" is meant here?

I don't know Sanskrit, but if somebody suggested that Unicode doesn't
support English because the important letter "double-oh" (as
in "moon", "spoon", "croon" etc.) was missing, I wouldn't be terribly
impressed. We have a "double-u" letter, why not "double-oh"?

Another quote:

    I should strongly recommend not to hurry with a standardization
    proposal until the text collection of Vedic texts has been finished

In other words, even the experts in Vedic texts don't yet know all the
characters which they may or may not need.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86559

From	Dave Angel <davea@davea.name>
Date	2015-02-26 20:57 -0500
Message-ID	<mailman.19299.1425002275.18130.python-list@python.org>
In reply to	#86557

On 02/26/2015 08:05 PM, Steven D'Aprano wrote:
> Rustom Mody wrote:
>

>
>> eg consider the case of 32 vs 64 bit executables.
>> The 64 bit executable is generally larger than the 32 bit one
>> Now consider the case of a machine that has say 2GB RAM and a 64-bit
>> processor. You could -- I think -- make a reasonable case that all those
>> all-zero hi-address-words are 'waste'.
>
> Sure. The whole point of 64-bit processors is to enable the use of more than
> 2GB of RAM. One might as well say that using 32-bit processors is wasteful
> if you only have 64K of memory. Yes it is, but the only things which use
> 16-bit or 8-bit processors these days are embedded devices.

But the 2gig means electrical address lines out of the CPU are wasted, 
not address space.  A 64 bit processor and 64bit OS means you can have 
more than 4gig in a process space, even if over half of it has to be in 
the swap file.  Linear versus physical makes a big difference.

(Although I believe Seymour Cray was quoted as saying that virtual 
memory is a crock, because "you can't fake what you ain't got.")




-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86562

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-27 16:58 +1100
Message-ID	<54f00787$0$12979$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86559

Dave Angel wrote:

> (Although I believe Seymour Cray was quoted as saying that virtual
> memory is a crock, because "you can't fake what you ain't got.")

If I recall correctly, disk access is about 10000 times slower than RAM, so
virtual memory is *at least* that much slower than real memory.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86564

From	Dave Angel <davea@davea.name>
Date	2015-02-27 02:30 -0500
Message-ID	<mailman.19302.1425022259.18130.python-list@python.org>
In reply to	#86562

On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
> Dave Angel wrote:
>
>> (Although I believe Seymour Cray was quoted as saying that virtual
>> memory is a crock, because "you can't fake what you ain't got.")
>
> If I recall correctly, disk access is about 10000 times slower than RAM, so
> virtual memory is *at least* that much slower than real memory.
>

It's so much more complicated than that, that I hardly know where to 
start.  I'll describe a generic processor/OS/memory/disk architecture; 
there will be huge differences between processor models even from a 
single manufacturer.

First, as soon as you add swapping logic to your 
processor/memory-system, you theoretically slow it down.  And in the 
days of that quote, Cray's memory was maybe 50 times as fast as the 
memory used by us mortals.  So adding swapping logic would have slowed 
it down quite substantially, even when it was not swapping.  But that 
logic is inside the CPU chip these days, and presumably thoroughly 
optimized.

Next, statistically, a program uses a small subset of its total program 
& data space in its working set, and the working set should reside in 
real memory.  But when the program greatly increases that working set, 
and it approaches the amount of physical memory, then swapping becomes 
more frenzied, and we say the program is thrashing.  Simple example, try 
sorting an array that's about the size of available physical memory.

Next, even physical memory is divided into a few levels of caching, some 
on-chip and some off.  And the caching is done in what I call strips, 
where accessing just one byte causes the whole strip to be loaded from 
non-cached memory.  I forget the current size for that, but it's maybe 
64 to 256 bytes or so.

If there are multiple processors (not multicore, but actual separate 
processors), then each one has such internal caches, and any writes on 
one processor may have to trigger flushes of all the other processors 
that happen to have the same strip loaded.

The processor not only prefetches the next few instructions, but decodes 
and tentatively executes them, subject to being discarded if a 
conditional branch doesn't go the way the processor predicted.  So some 
instructions execute in zero time, some of the time.

Every address of instruction fetch, or of data fetch or store, goes 
through a couple of layers of translation.  Segment register plus offset 
gives linear address.  Lookup those in tables to get physical address, 
and if table happens not to be in on-chip cache, swap it in.  If 
physical address isn't valid, a processor exception causes the OS to 
potentially swap something out, and something else in.

Once we're paging from the swapfile, the size of the read is perhaps 4k. 
  And that read is regardless of whether we're only going to use one 
byte or all of it.

The ratio between an access which was in the L1 cache and one which 
required a page to be swapped in from disk?  Much bigger than your 
10,000 figure.  But hopefully it doesn't happen a big percentage of the 
time.

Many, many other variables, like the fact that RAM chips are not 
directly addressable by bytes, but instead count on rows and columns. 
So if you access many bytes in the same row, it can be much quicker than 
random access.  So simple access time specifications don't mean as much 
as it would seem;  the controller has to balance the RAM spec with the 
various cache requirements.
-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86571

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-02-27 22:54 +1100
Message-ID	<54f05aff$0$12980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#86564

Dave Angel wrote:

> On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
>> Dave Angel wrote:
>>
>>> (Although I believe Seymour Cray was quoted as saying that virtual
>>> memory is a crock, because "you can't fake what you ain't got.")
>>
>> If I recall correctly, disk access is about 10000 times slower than RAM,
>> so virtual memory is *at least* that much slower than real memory.
>>
> 
> It's so much more complicated than that, that I hardly know where to
> start.

[snip technical details]

As interesting as they were, none of those details will make swap faster,
hence my comment that virtual memory is *at least* 10000 times slower than
RAM.




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#86572

From	Dave Angel <davea@davea.name>
Date	2015-02-27 09:02 -0500
Message-ID	<mailman.19306.1425045769.18130.python-list@python.org>
In reply to	#86571

On 02/27/2015 06:54 AM, Steven D'Aprano wrote:
> Dave Angel wrote:
>
>> On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
>>> Dave Angel wrote:
>>>
>>>> (Although I believe Seymour Cray was quoted as saying that virtual
>>>> memory is a crock, because "you can't fake what you ain't got.")
>>>
>>> If I recall correctly, disk access is about 10000 times slower than RAM,
>>> so virtual memory is *at least* that much slower than real memory.
>>>
>>
>> It's so much more complicated than that, that I hardly know where to
>> start.
>
> [snip technical details]
>
> As interesting as they were, none of those details will make swap faster,
> hence my comment that virtual memory is *at least* 10000 times slower than
> RAM.
>

The term "virtual memory" is used for many aspects of the modern memory 
architecture.  But I presume you're using it in the sense of "running in 
a swapfile" as opposed to running in physical RAM.

Yes, a page fault takes on the order of 10,000 times as long as an 
access to a location in L1 cache.  I suspect it's a lot smaller though 
if the swapfile is on an SSD drive.  The first byte is that slow.

But once the fault is resolved, the nearby bytes are in physical memory, 
and some of them are in L3, L2, and L1.  So you're not running in the 
swapfile any more.  And even when you run off the end of the page, 
fetching the sequentially adjacent page from a hard disk is much faster. 
  And if the disk has well designed buffering, faster yet.  The OS tries 
pretty hard to keep the swapfile unfragmented.

The trick is to minimize the number of page faults, especially to random 
locations.  If you're getting lots of them, it's called thrashing.

There are tools to help with that.  To minimize page faults on code, 
linking with a good working-set-tuner can help, though I don't hear of 
people bothering these days.  To minimize page faults on data, choosing 
one's algorithm carefully can help.  For example, in scanning through a 
typical matrix, row order might be adjacent locations, while column 
order might be scattered.

Not really much different than reading a text file.  If you can arrange 
to process it a line at a time, rather than reading the whole file into 
memory, you generally minimize your round-trips to disk.  And if you 
need to randomly access it, it's quite likely more efficient to memory 
map it, in which case it temporarily becomes part of the swapfile system.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#86573

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-28 01:22 +1100
Message-ID	<mailman.19307.1425046945.18130.python-list@python.org>
In reply to	#86571

On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel <davea@davea.name> wrote:
> The term "virtual memory" is used for many aspects of the modern memory
> architecture.  But I presume you're using it in the sense of "running in a
> swapfile" as opposed to running in physical RAM.

Given that this started with a quote about "you can't fake what you
ain't got", I would say that, yes, this refers to using hard disk to
provide more RAM.

If you're trying to use the pagefile/swapfile as if it's more memory
("I have 256MB of memory, but 10GB of swap space, so that's 10GB of
memory!"), then yes, these performance considerations are huge. But
suppose you need to run a program that's larger than your available
RAM. On MS-DOS, sometimes you'd need to work with program overlays (a
concept borrowed from older systems, but ones that I never worked on,
so I'm going back no further than DOS here). You get a *massive*
complexity hit the instant you start using them, whether your program
would have been able to fit into memory on some systems or not. Just
making it possible to have only part of your code in memory places
demands on your code that you, the programmer, have to think about.
With virtual memory, though, you just write your code as if it's all
in memory, and some of it may, at some times, be on disk. Less code to
debug = less time spent debugging. The performance question is largely
immaterial (you'll be using the disk either way), but the savings on
complexity are tremendous. And then when you do find yourself running
on a system with enough RAM? No code changes needed, and full
performance. That's where virtual memory shines.

It's funny how the world changes, though. Back in the 90s, virtual
memory was the key. No home computer ever had enough RAM. Today? A
home-grade PC could easily have 16GB... and chances are you don't need
all of that. So we go for the opposite optimization: disk caching.
Apart from when I rebuild my "Audio-Only Frozen" project [1] and the
caches get completely blasted through, heaps and heaps of my work can
be done inside the disk cache. Hey, Sikorsky, got any files anywhere
on the hard disk matching *Pastel*.iso case insensitively? *chug chug
chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case
insensitively? *zip* Yeah, here it is. I didn't tell the first search
to hold all that file system data in memory; the hard drive controller
managed it all for me, and I got the performance benefit. Same as the
above: the main benefit is that this sort of thing requires zero
application code complexity. It's all done in a perfectly generic way
at a lower level.

ChrisA

[toc] | [prev] | [next] | [standalone]

#86575

From	alister <alister.nospam.ware@ntlworld.com>
Date	2015-02-27 16:00 +0000
Message-ID	<mcq4a9$29g$1@speranza.aioe.org>
In reply to	#86573

On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote:

> 
> If you're trying to use the pagefile/swapfile as if it's more memory ("I
> have 256MB of memory, but 10GB of swap space, so that's 10GB of
> memory!"), then yes, these performance considerations are huge. But
> suppose you need to run a program that's larger than your available RAM.
> On MS-DOS, sometimes you'd need to work with program overlays (a concept
> borrowed from older systems, but ones that I never worked on, so I'm
> going back no further than DOS here). You get a *massive* complexity hit
> the instant you start using them, whether your program would have been
> able to fit into memory on some systems or not. Just making it possible
> to have only part of your code in memory places demands on your code
> that you, the programmer, have to think about. With virtual memory,
> though, you just write your code as if it's all in memory, and some of
> it may, at some times, be on disk. Less code to debug = less time spent
> debugging. The performance question is largely immaterial (you'll be
> using the disk either way), but the savings on complexity are
> tremendous. And then when you do find yourself running on a system with
> enough RAM? No code changes needed, and full performance. That's where
> virtual memory shines.
> ChrisA

I think there is a case for bringing back the overlay file, or at least 
loading larger programs in sections
only loading the routines as they are required could speed up the start 
time of many large applications.
examples libre office, I rarely need the mail merge function, the word 
count and may other features that could be added into the running 
application on demand rather than all at once.

obviously with large memory & virtual mem there is no need to un-install 
them once loaded. 



-- 
Ralph's Observation:
	It is a mistake to let any mechanical object realise that you
	are in a hurry.

[toc] | [prev] | [next] | [standalone]

Page 2 of 8 — ← Prev page 1 [2] 3 4 5 6 7 8 Next page →

csiph-web

Newbie question about text encoding

Contents

#86367

#86373

#86343

#86495

#86498

#86499

#86514

#86520

#86519

#86526

#86542

#86551

#86557

#86559

#86562

#86564

#86571

#86572

#86573

#86575