Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #20242 > unrolled thread

Re: Python usage numbers

Started byChris Angelico <rosuav@gmail.com>
First post2012-02-12 12:28 +1100
Last post2012-02-15 11:56 +0200
Articles 20 on this page of 109 — 31 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 12:28 +1100
    Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 02:23 +0000
      Re: Python usage numbers Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-11 18:36 -0800
        Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 15:38 +1100
          Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 05:51 +0000
            Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 17:08 +1100
            Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 10:48 -0500
              Re: Python usage numbers Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-12 11:47 -0500
                Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 12:11 -0500
                  Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:49 +0000
            Re: Python usage numbers Dan Sommers <dan@tombstonezero.net> - 2012-02-12 15:55 +0000
            Re: Python usage numbers rusi <rustompmody@gmail.com> - 2012-02-12 08:50 -0800
              Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 12:21 -0500
              Re: Python usage numbers Nick Dokos <nicholas.dokos@hp.com> - 2012-02-12 12:36 -0500
                entering unicode  (was Python usage numbers) rusi <rustompmody@gmail.com> - 2012-02-12 19:09 -0800
                  Re: entering unicode  (was Python usage numbers) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-19 03:44 +0000
                    Re: entering unicode (was Python usage numbers) rusi <rustompmody@gmail.com> - 2012-02-19 00:52 -0800
              How do you Unicode proponents type your non-ASCII characters? (was: Python usage numbers) Ben Finney <ben+python@benfinney.id.au> - 2012-02-13 09:43 +1100
              Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:56 +0000
          Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 10:13 -0500
            Re: Python usage numbers Terry Reedy <tjreedy@udel.edu> - 2012-02-12 17:07 -0500
              Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 17:22 -0500
            Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 09:14 +1100
              Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 17:27 -0500
                Re: Python usage numbers Dave Angel <davea@dejaviewphoto.com> - 2012-02-12 17:40 -0500
                Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 23:29 +0000
                  Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 18:41 -0500
                  Re: Python usage numbers Dave Angel <d@davea.name> - 2012-02-12 19:03 -0500
                  Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 11:59 +1100
                    Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 20:11 -0500
            Re: Python usage numbers Christian Heimes <lists@cheimes.de> - 2012-02-13 01:00 +0100
            Re: Python usage numbers Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-12 21:37 -0500
            Re: Python usage numbers Terry Reedy <tjreedy@udel.edu> - 2012-02-12 22:09 -0500
              Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 22:57 -0500
                Re: Python usage numbers Ben Finney <ben+python@benfinney.id.au> - 2012-02-13 15:19 +1100
                  Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-13 12:26 -0600
              Re: Python usage numbers jmfauth <wxjmfauth@gmail.com> - 2012-02-14 00:00 -0800
        Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 06:10 +0000
          Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-12 01:05 -0600
            Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 09:12 +0000
              Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-12 05:11 -0600
                Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:30 +0000
                  Re: Python usage numbers Dave Angel <d@davea.name> - 2012-02-12 17:50 -0500
              Re: Python usage numbers Peter Pearson <ppearson@nowhere.invalid> - 2012-02-12 17:58 +0000
          Re: Python usage numbers Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-12 20:48 -0800
            Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 16:03 +1100
            OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-13 08:05 +0000
              Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 08:01 -0800
                Re: OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-13 16:12 +0000
                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 08:27 -0800
                Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-13 11:38 -0700
                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 13:01 -0800
                    Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 08:27 +1100
                    Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-13 21:46 +0000
                    Re: OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-14 00:19 +0000
                      Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 17:07 -0800
                    Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-13 18:29 -0700
                      Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-17 17:13 -0800
                        Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-18 13:13 +1100
                        Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-18 02:39 +0000
                        Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-18 00:28 -0700
                          Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-18 07:02 -0800
                            Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-18 16:15 +0000
                              Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-18 10:34 -0800
                                Re: OT: Entitlements [was Re: Python usage numbers] random joe <pywin32@gmail.com> - 2012-02-18 10:49 -0800
                            Re: OT: Entitlements [was Re: Python usage numbers] Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-02-26 12:14 +0000
                        Re: OT: Entitlements [was Re: Python usage numbers] Terry Reedy <tjreedy@udel.edu> - 2012-02-18 04:16 -0500
                    Re: OT: Entitlements [was Re: Python usage numbers] John O'Hagan <research@johnohagan.com> - 2012-02-14 19:41 +1100
                      Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:21 -0800
                        Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-15 11:44 +1100
                          Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 17:26 -0800
                            Re: OT: Entitlements [was Re: Python usage numbers] John O'Hagan <research@johnohagan.com> - 2012-02-15 19:56 +1100
                              Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-15 07:04 -0800
                                Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-15 15:18 +0000
                                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-15 08:27 -0800
                                    Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-15 17:16 +0000
                                Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-15 09:46 -0700
                            Re: OT: Entitlements [was Re: Python usage numbers] Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-02-26 12:44 +0000
                              Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-26 12:35 -0800
                                Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-27 07:50 +1100
                                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-26 14:32 -0800
                              Re: OT: Entitlements Ben Finney <ben+python@benfinney.id.au> - 2012-02-27 07:46 +1100
                Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 07:47 +1100
                Re: OT: Entitlements [was Re: Python usage numbers] Michael Torrie <torriem@gmail.com> - 2012-02-13 14:46 -0700
                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 16:39 -0800
                    Re: OT: Entitlements [was Re: Python usage numbers] Michael Torrie <torriem@gmail.com> - 2012-02-13 18:36 -0700
                    Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 12:37 +1100
                      Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-17 17:37 -0800
                Re: OT: Entitlements [was Re: Python usage numbers] Tim Wintle <tim.wintle@teamrubber.com> - 2012-02-13 16:41 +0000
                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:40 -0800
                    RE: OT: Entitlements [was Re: Python usage numbers] "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-02-17 20:09 +0000
                Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-14 11:31 +0000
                  Re: OT: Entitlements [was Re: Python usage numbers] Devin Jeanpierre <jeanpierreda@gmail.com> - 2012-02-14 07:06 -0500
                  Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:48 -0800
                    Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-15 12:32 +1100
                    Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-15 09:47 +0000
                      Re: OT: Entitlements [was Re: Python usage numbers] Arnaud Delobelle <arnodel@gmail.com> - 2012-02-15 09:58 +0000
                        Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-15 10:04 +0000
                          Kill files [was Re: OT: Entitlements [was Re: Python usage numbers]] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-15 10:27 +0000
                            Re: Kill files [was Re: OT: Entitlements [was Re: Python usage numbers]] Ethan Furman <ethan@stoneleaf.us> - 2012-02-15 11:29 -0800
                Re: OT: Entitlements [was Re: Python usage numbers] rusi <rustompmody@gmail.com> - 2012-02-14 04:56 -0800
                Re: OT: Entitlements [was Re: Python usage numbers] Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-14 09:37 -0500
      Re: Python usage numbers Matej Cepl <mcepl@redhat.com> - 2012-02-12 09:14 +0100
        Re: Python usage numbers Matej Cepl <mcepl@redhat.com> - 2012-02-12 09:26 +0100
          Re: Python usage numbers Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-12 12:11 +0000
            Re: Python usage numbers alister <alister.ware@ntlworld.com> - 2012-02-12 18:55 +0000
              Re: Python usage numbers jmfauth <wxjmfauth@gmail.com> - 2012-02-12 11:52 -0800
                French and IDLE on Windows  (was Re: Python usage numbers) Terry Reedy <tjreedy@udel.edu> - 2012-02-12 18:30 -0500
          Re: Python usage numbers Anssi Saari <as@sci.fi> - 2012-02-15 11:56 +0200

Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6  Next page →


#20296

FromTerry Reedy <tjreedy@udel.edu>
Date2012-02-12 17:07 -0500
Message-ID<mailman.5738.1329084478.27778.python-list@python.org>
In reply to#20272
On 2/12/2012 10:13 AM, Roy Smith wrote:

> Exactly.<soapbox class="wise-old-geezer">.  ASCII was so successful
> at becoming a universal standard which lasted for decades,

I think you are overstating the universality and length. I used a 
machine in the 1970s with 60-bit words that could be interpreted as 10 
6-bit characters. IBM used EBCDIC at least into the 1980s. The UCLA 
machine I used had a translator for ascii terminals that connected by 
modems. I remember discussing the translation table with the man in 
charge of it. Dedicated wordprocessing machines of the 70s and 80s *had* 
to use something other than plain ascii, as it is inadequate for 
business text, as opposed to pure computation and labeled number tables. 
Whether they used extended ascii or something else, I have no idea.

Ascii was, however, as far as I know, the universal basis for the new 
personal computers starting about 1975, and most importantly, for the 
IBM PC. But even that actually used its version of extended ascii, as 
did each wordprocessing program.

 > people who
> grew up with it don't realize there was once any other way.  Not just
> EBCDIC, but also SIXBIT, RAD-50, tilt/rotate, packed card records,
> and so on. Transcoding was a way of life, and if you didn't know what
> you were starting with and aiming for, it was hopeless.

But because of the limitation of ascii on a worldwide, as opposed to 
American basis, we ended up with 100-200 codings for almost as many 
character sets. This is because the idea of ascii was applied by each 
nation or language group individually to their local situation.

 > Kind of like now where we are again with Unicode.</soapbox>

The situation before ascii is like where we ended up *before* unicode. 
Unicode aims to replace all those byte encoding and character sets with 
*one* byte encoding for *one* character set, which will be a great 
simplification. It is the idea of ascii applied on a global rather that 
local basis.

Let me repeat. Unicode and utf-8 is a solution to the mess, not the 
cause. Perhaps we should have a synonym for utf-8: escii, for Earthian 
Standard Code for Information Interchange.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#20298

FromRoy Smith <roy@panix.com>
Date2012-02-12 17:22 -0500
Message-ID<roy-F2390D.17223312022012@news.panix.com>
In reply to#20296
In article <mailman.5738.1329084478.27778.python-list@python.org>,
 Terry Reedy <tjreedy@udel.edu> wrote:

> Let me repeat. Unicode and utf-8 is a solution to the mess, not the 
> cause. Perhaps we should have a synonym for utf-8: escii, for Earthian 
> Standard Code for Information Interchange.

I'm not arguing that Unicode is where we need to get to.  Just trying to 
give a little history.

[toc] | [prev] | [next] | [standalone]


#20297

FromChris Angelico <rosuav@gmail.com>
Date2012-02-13 09:14 +1100
Message-ID<mailman.5739.1329084873.27778.python-list@python.org>
In reply to#20272
On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> The situation before ascii is like where we ended up *before* unicode.
> Unicode aims to replace all those byte encoding and character sets with
> *one* byte encoding for *one* character set, which will be a great
> simplification. It is the idea of ascii applied on a global rather that
> local basis.

Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so
are UTF-16, UTF-32. and as many more as you could hope for. But
broadly yes, Unicode IS the solution.

ChrisA

[toc] | [prev] | [next] | [standalone]


#20299

FromRoy Smith <roy@panix.com>
Date2012-02-12 17:27 -0500
Message-ID<roy-E2270B.17273412022012@news.panix.com>
In reply to#20297
In article <mailman.5739.1329084873.27778.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> > The situation before ascii is like where we ended up *before* unicode.
> > Unicode aims to replace all those byte encoding and character sets with
> > *one* byte encoding for *one* character set, which will be a great
> > simplification. It is the idea of ascii applied on a global rather that
> > local basis.
> 
> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so
> are UTF-16, UTF-32. and as many more as you could hope for. But
> broadly yes, Unicode IS the solution.

I could hope for one and only one, but I know I'm just going to be 
disapointed.  The last project I worked on used UTF-8 in most places, 
but also used some C and Java libraries which were only available for 
UTF-16.  So it was transcoding hell all over the place.

Hopefully, we will eventually reach the point where storage is so cheap 
that nobody minds how inefficient UTF-32 is and we all just start using 
that.  Life will be a lot simpler then.  No more transcoding, a string 
will just as many bytes as it is characters, and everybody will be happy 
again.

[toc] | [prev] | [next] | [standalone]


#20302

FromDave Angel <davea@dejaviewphoto.com>
Date2012-02-12 17:40 -0500
Message-ID<mailman.5740.1329086473.27778.python-list@python.org>
In reply to#20299
On 02/12/2012 05:27 PM, Roy Smith wrote:
> In article<mailman.5739.1329084873.27778.python-list@python.org>,
>   Chris Angelico<rosuav@gmail.com>  wrote:
>
>> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu>  wrote:
>>> The situation before ascii is like where we ended up *before* unicode.
>>> Unicode aims to replace all those byte encoding and character sets with
>>> *one* byte encoding for *one* character set, which will be a great
>>> simplification. It is the idea of ascii applied on a global rather that
>>> local basis.
>> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so
>> are UTF-16, UTF-32. and as many more as you could hope for. But
>> broadly yes, Unicode IS the solution.
> I could hope for one and only one, but I know I'm just going to be
> disapointed.  The last project I worked on used UTF-8 in most places,
> but also used some C and Java libraries which were only available for
> UTF-16.  So it was transcoding hell all over the place.
>
> Hopefully, we will eventually reach the point where storage is so cheap
> that nobody minds how inefficient UTF-32 is and we all just start using
> that.  Life will be a lot simpler then.  No more transcoding, a string
> will just as many bytes as it is characters, and everybody will be happy
> again.

Keep your in-memory character strings as Unicode, and only 
serialize(encode) them when they go to/from a device, or to/from 
anachronistic code.  Then the cost is realized at the point of the 
problem.  No different than when deciding how to serialize any other 
data type.  Do it only at the point of entry/exit of your program.

But as long as devices are addressed as bytes, or as anything smaller 
than 32bit thingies, you will have encoding issues when writing to the 
device, and decoding issues when reading.  At the very least, you have 
big-endian/little-endian ways to encode that UCS-4 code point.







[toc] | [prev] | [next] | [standalone]


#20310

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-02-12 23:29 +0000
Message-ID<4f384b6e$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to#20299
On Sun, 12 Feb 2012 17:27:34 -0500, Roy Smith wrote:

> In article <mailman.5739.1329084873.27778.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> 
>> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote:
>> > The situation before ascii is like where we ended up *before*
>> > unicode. Unicode aims to replace all those byte encoding and
>> > character sets with *one* byte encoding for *one* character set,
>> > which will be a great simplification. It is the idea of ascii applied
>> > on a global rather that local basis.
>> 
>> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so
>> are UTF-16, UTF-32. and as many more as you could hope for. But broadly
>> yes, Unicode IS the solution.
> 
> I could hope for one and only one, but I know I'm just going to be
> disapointed.  The last project I worked on used UTF-8 in most places,
> but also used some C and Java libraries which were only available for
> UTF-16.  So it was transcoding hell all over the place.

Um, surely the solution to that is to always call a simple wrapper 
function to the UTF-16 code to handle the transcoding? What do the Design 
Patterns people call it, a facade? No, an adapter. (I never remember the 
names...)

Instead of calling library.foo() which only outputs UTF-16, write a 
wrapper myfoo() which calls foo, captures its output and transcribes to 
UTF-8. You have to do that once (per function), but now it works from 
everywhere, so long as you remember to always call myfoo instead of foo.


> Hopefully, we will eventually reach the point where storage is so cheap
> that nobody minds how inefficient UTF-32 is and we all just start using
> that.  Life will be a lot simpler then.  No more transcoding, a string
> will just as many bytes as it is characters, and everybody will be happy
> again.

I think you mean 4 times as many bytes as characters. Unless you have 32 
bit bytes :)


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#20313

FromRoy Smith <roy@panix.com>
Date2012-02-12 18:41 -0500
Message-ID<roy-F23C47.18412012022012@news.panix.com>
In reply to#20310
In article <4f384b6e$0$29986$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> > I could hope for one and only one, but I know I'm just going to be
> > disapointed.  The last project I worked on used UTF-8 in most places,
> > but also used some C and Java libraries which were only available for
> > UTF-16.  So it was transcoding hell all over the place.
> 
> Um, surely the solution to that is to always call a simple wrapper 
> function to the UTF-16 code to handle the transcoding? What do the Design 
> Patterns people call it, a facade? No, an adapter. (I never remember the 
> names...)

I am familiar with the concept.  It was ICU.  A very big library.  Lots 
of calls.  I don't remember the details, I'm sure we wrote wrappers.  It 
was still a mess.

> > Hopefully, we will eventually reach the point where storage is so cheap
> > that nobody minds how inefficient UTF-32 is and we all just start using
> > that.  Life will be a lot simpler then.  No more transcoding, a string
> > will just as many bytes as it is characters, and everybody will be happy
> > again.
> 
> I think you mean 4 times as many bytes as characters. Unless you have 32 
> bit bytes :)

Yes, exactly.

[toc] | [prev] | [next] | [standalone]


#20317

FromDave Angel <d@davea.name>
Date2012-02-12 19:03 -0500
Message-ID<mailman.5747.1329091472.27778.python-list@python.org>
In reply to#20310
On 02/12/2012 06:29 PM, Steven D'Aprano wrote:
> On Sun, 12 Feb 2012 17:27:34 -0500, Roy Smith wrote:
>
>> <SNIP>
>> Hopefully, we will eventually reach the point where storage is so cheap
>> that nobody minds how inefficient UTF-32 is and we all just start using
>> that.  Life will be a lot simpler then.  No more transcoding, a string
>> will just as many bytes as it is characters, and everybody will be happy
>> again.
> I think you mean 4 times as many bytes as characters. Unless you have 32
> bit bytes :)
>
>
Until you have 32 bit bytes, you'll continue to have encodings, even if 
only a couple of them.




-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#20319

FromChris Angelico <rosuav@gmail.com>
Date2012-02-13 11:59 +1100
Message-ID<mailman.5750.1329094801.27778.python-list@python.org>
In reply to#20310
On Mon, Feb 13, 2012 at 11:03 AM, Dave Angel <d@davea.name> wrote:
> On 02/12/2012 06:29 PM, Steven D'Aprano wrote:
>> I think you mean 4 times as many bytes as characters. Unless you have 32
>> bit bytes :)
>>
>>
> Until you have 32 bit bytes, you'll continue to have encodings, even if only
> a couple of them.

The advantage, though, is that you can always know how many bytes to
read for X characters. In ASCII, you allocate 80 bytes of storage and
you can store 80 characters. In UTF-8, if you want an 80-character
buffer, you can probably get away with allocating 240 characters...
but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you
know you can store them. Also, you know exactly where the 17th
character is; in UTF-8, you have to count. That's a huge advantage for
in-memory strings; but is it useful on disk, where (as likely as not)
you're actually looking for lines, which you still have to scan for?
I'm thinking not, so it makes sense to use a smaller disk image than
UTF-32 - less total bytes means less sectors to read/write, which
translates fairly directly into performance.

ChrisA

[toc] | [prev] | [next] | [standalone]


#20320

FromRoy Smith <roy@panix.com>
Date2012-02-12 20:11 -0500
Message-ID<roy-F2E8F4.20110312022012@news.panix.com>
In reply to#20319
In article <mailman.5750.1329094801.27778.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> The advantage, though, is that you can always know how many bytes to
> read for X characters. In ASCII, you allocate 80 bytes of storage and
> you can store 80 characters. In UTF-8, if you want an 80-character
> buffer, you can probably get away with allocating 240 characters...
> but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you
> know you can store them. Also, you know exactly where the 17th
> character is; in UTF-8, you have to count. That's a huge advantage for
> in-memory strings; but is it useful on disk, where (as likely as not)
> you're actually looking for lines, which you still have to scan for?
> I'm thinking not, so it makes sense to use a smaller disk image than
> UTF-32 - less total bytes means less sectors to read/write, which
> translates fairly directly into performance.

You might just write files compressed.  My guess is that a typical 
gzipped UTF-32 text file will be smaller than the same data stored as 
uncompressed UTF-8.

[toc] | [prev] | [next] | [standalone]


#20315

FromChristian Heimes <lists@cheimes.de>
Date2012-02-13 01:00 +0100
Message-ID<mailman.5746.1329091227.27778.python-list@python.org>
In reply to#20272
Am 12.02.2012 23:07, schrieb Terry Reedy:
> But because of the limitation of ascii on a worldwide, as opposed to
> American basis, we ended up with 100-200 codings for almost as many
> character sets. This is because the idea of ascii was applied by each
> nation or language group individually to their local situation.

You really learn to appreciate unicode when you have to deal with mixed
languages in texts and old databases from the 70ties and 80ties.

I'm working with books that contain medieval German, old German, modern
German, English, French, Latin, Hebrew, Arabic, ancient and modern
Greek, Rhaeto-Romanic, East European and more languages. Sometimes three
or four languages are used in a single book. Some books are more than
700 years old and contain glyphs that aren't covered by unicode yet.
Without unicode it would be virtually impossible to deal with it.

Metadata for these books come from old and proprietary databases and are
stored in a format that is optimized for magnetic tape. Most people will
never have heard about ISO-5426 or ANSEL encoding or about file formats
like MAB2, MARC or PICA. It took me quite some time to develop codecs to
encode and decode an old and partly undocumented variable multibyte
encodings that predates UTF-8 by about a decade. Of course every system
interprets the undocumented parts slightly different ...

Unicode and XML are bliss for metadata exchange and long term storage!

[toc] | [prev] | [next] | [standalone]


#20321

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2012-02-12 21:37 -0500
Message-ID<mailman.5751.1329100699.27778.python-list@python.org>
In reply to#20272
On Sun, 12 Feb 2012 17:07:44 -0500, Terry Reedy <tjreedy@udel.edu>
wrote:

>I think you are overstating the universality and length. I used a 
>machine in the 1970s with 60-bit words that could be interpreted as 10 
>6-bit characters. IBM used EBCDIC at least into the 1980s. The UCLA 

	The Xerox Sigma series also used EBCDIC (probably not a surprise --
I believe the precursor company, Scientific Data Systems, was founded by
ex-IBM folk). One nice thing about EBCDIC was that, in hex, the
characters could be mapped quite easily with Hollerith cards -- the
lower nybble mapped to the card 0-9 rows, and the high nybble correlated
to the top card rows.

	Of course, the Sigma was a weird machine all by itself, what with
over 200 discrete hardware interrupt vectors, four-bank interleaved
memory, etc.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#20323

FromTerry Reedy <tjreedy@udel.edu>
Date2012-02-12 22:09 -0500
Message-ID<mailman.5752.1329102603.27778.python-list@python.org>
In reply to#20272
On 2/12/2012 5:14 PM, Chris Angelico wrote:
> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu>  wrote:
>> The situation before ascii is like where we ended up *before* unicode.
>> Unicode aims to replace all those byte encoding and character sets with
>> *one* byte encoding for *one* character set, which will be a great
>> simplification. It is the idea of ascii applied on a global rather that
>> local basis.
>
> Unicode doesn't deal with byte encodings; UTF-8 is an encoding,

The Unicode Standard specifies 3 UTF storage formats* and 8 UTF 
byte-oriented transmission formats. UTF-8 is the most common of all 
encodings for web pages. (And ascii pages are utf-8 also.) It is the 
only one of the 8 most of us need to much bother with. Look here for the 
list
http://www.unicode.org/glossary/#U
and for details look in various places in
http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf

> but so are UTF-16, UTF-32.
 > and as many more as you could hope for.

All the non-UTF 'as many more as you could hope for' encodings are not 
part of Unicode.

* The new internal unicode scheme for 3.3 is pretty much a mixture of 
the 3 storage formats (I am of course, skipping some details) by using 
the widest one needed for each string. The advantage is avoiding 
problems with each of the three. The disadvantage is greater internal 
complexity, but that should be hidden from users. They will not need to 
care about the internals. They will be able to forget about 'narrow' 
versus 'wide' builds and the possible requirement to code differently 
for each. There will only be one scheme that works the same on all 
platforms. Most apps should require less space and about the same time.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#20326

FromRoy Smith <roy@panix.com>
Date2012-02-12 22:57 -0500
Message-ID<roy-753AB0.22570112022012@news.panix.com>
In reply to#20323
In article <mailman.5752.1329102603.27778.python-list@python.org>,
 Terry Reedy <tjreedy@udel.edu> wrote:

> On 2/12/2012 5:14 PM, Chris Angelico wrote:
> > On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu>  wrote:
> >> The situation before ascii is like where we ended up *before* unicode.
> >> Unicode aims to replace all those byte encoding and character sets with
> >> *one* byte encoding for *one* character set, which will be a great
> >> simplification. It is the idea of ascii applied on a global rather that
> >> local basis.
> >
> > Unicode doesn't deal with byte encodings; UTF-8 is an encoding,
> 
> The Unicode Standard specifies 3 UTF storage formats* and 8 UTF 
> byte-oriented transmission formats. UTF-8 is the most common of all 
> encodings for web pages. (And ascii pages are utf-8 also.) It is the 
> only one of the 8 most of us need to much bother with. Look here for the 
> list
> http://www.unicode.org/glossary/#U
> and for details look in various places in
> http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
> 
> > but so are UTF-16, UTF-32.
>  > and as many more as you could hope for.
> 
> All the non-UTF 'as many more as you could hope for' encodings are not 
> part of Unicode.
> 
> * The new internal unicode scheme for 3.3 is pretty much a mixture of 
> the 3 storage formats (I am of course, skipping some details) by using 
> the widest one needed for each string. The advantage is avoiding 
> problems with each of the three. The disadvantage is greater internal 
> complexity, but that should be hidden from users. They will not need to 
> care about the internals. They will be able to forget about 'narrow' 
> versus 'wide' builds and the possible requirement to code differently 
> for each. There will only be one scheme that works the same on all 
> platforms. Most apps should require less space and about the same time.

All that is just fine, but what the heck are we going to do about ascii 
art, that's what I want to know.  Python just won't be the same in UTF-8.



                    /^\/^\
                  _|__|  O|
         \/     /~     \_/ \
          \____|__________/  \
                 \_______      \
                         `\     \                 \
                           |     |                  \
                          /      /                    \
                         /     /                       \\
                       /      /                         \ \
                      /     /                            \  \
                    /     /             _----_            \   \
                   /     /           _-~      ~-_         |   |
                  (      (        _-~    _--_    ~-_     _/   |
                   \      ~-____-~    _-~    ~-_    ~-_-~    /
                     ~-_           _-~          ~-_       _-~   - jurcy -
                        ~--______-~                ~-___-~

[toc] | [prev] | [next] | [standalone]


#20329

FromBen Finney <ben+python@benfinney.id.au>
Date2012-02-13 15:19 +1100
Message-ID<87ty2v1i2m.fsf@benfinney.id.au>
In reply to#20326
Roy Smith <roy@panix.com> writes:

> All that is just fine, but what the heck are we going to do about ascii 
> art, that's what I want to know.  Python just won't be the same in
> UTF-8.

If it helps, ASCII art *is* UTF-8 art. So it will be the same in UTF-8.

Or maybe you already knew that, and your sarcasm was lost with the high
bit.

-- 
 \     “We are all agreed that your theory is crazy. The question that |
  `\      divides us is whether it is crazy enough to have a chance of |
_o__)            being correct.” —Niels Bohr (to Wolfgang Pauli), 1958 |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#20356

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2012-02-13 12:26 -0600
Message-ID<mailman.5768.1329157620.27778.python-list@python.org>
In reply to#20329
On 2/12/2012 10:19 PM, Ben Finney wrote:
> If it helps, ASCII art *is* UTF-8 art. So it will be the same in UTF-8.
As will non-ASCII text art:

   /l、
 ゙(゚、 。 7
  l、゙ ~ヽ
  じしf_, )ノ

-- 
CPython 3.2.2 | Windows NT 6.1.7601.17640

[toc] | [prev] | [next] | [standalone]


#20389

Fromjmfauth <wxjmfauth@gmail.com>
Date2012-02-14 00:00 -0800
Message-ID<2076e822-6225-449d-8d93-bf8d1627e77a@l14g2000vbe.googlegroups.com>
In reply to#20323
On 13 fév, 04:09, Terry Reedy <tjre...@udel.edu> wrote:
>
>
> * The new internal unicode scheme for 3.3 is pretty much a mixture of
> the 3 storage formats (I am of course, skipping some details) by using
> the widest one needed for each string. The advantage is avoiding
> problems with each of the three. The disadvantage is greater internal
> complexity, but that should be hidden from users. They will not need to
> care about the internals. They will be able to forget about 'narrow'
> versus 'wide' builds and the possible requirement to code differently
> for each. There will only be one scheme that works the same on all
> platforms. Most apps should require less space and about the same time.
>
> --


Python 2 was built for ascii users. Now, Python 3(.3) is
*optimized* for the ascii users.

And the rest of the crowd? Not so sure, French users
(among others) who can not write their texts will
iso-8859-1/latin1 will be very happy.

No doubts, it will work. Is this however the correct
approach?

jmf

[toc] | [prev] | [next] | [standalone]


#20252

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-02-12 06:10 +0000
Message-ID<4f3757cc$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to#20245
On Sat, 11 Feb 2012 18:36:52 -0800, Rick Johnson wrote:

>> "I have a file containing text. I can open it in an editor and see it's
>> nearly all ASCII text, except for a few weird and bizarre characters
>> like £ © ± or ö. In Python 2, I can read that file fine. In Python 3 I
>> get an error. What should I do that requires no thought?"
>>
>> Obvious answers:
> 
> the most obvious answer would be to read the file WITHOUT worrying about
> asinine encoding.

Your mad leet reading comprehension skillz leave me in awe Rick.

If you try to read a file containing non-ASCII characters encoded using 
UTF8 on Windows without explicitly specifying either UTF8 as the 
encoding, or an error handler, you will get an exception.

It's not just UTF8 either, but nearly all encodings. You can't even 
expect to avoid problems if you stick to nothing but Windows, because 
Windows' default encoding is localised: a file generated in (say) Israel 
or Japan or Germany will use a different code page (encoding) by default 
than one generated in (say) the US, Canada or UK.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#20254

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2012-02-12 01:05 -0600
Message-ID<mailman.5718.1329030346.27778.python-list@python.org>
In reply to#20252
On 2/12/2012 12:10 AM, Steven D'Aprano wrote:
> It's not just UTF8 either, but nearly all encodings. You can't even 
> expect to avoid problems if you stick to nothing but Windows, because 
> Windows' default encoding is localised: a file generated in (say) Israel 
> or Japan or Germany will use a different code page (encoding) by default 
> than one generated in (say) the US, Canada or UK.
Generated by what? Windows will store a locale value for programs to
use, but programs use Unicode internally by default (i.e., API calls are
Unicode unless they were built for old versions of Windows), and the
default filesystem (NTFS) uses Unicode for file names. AFAIK, only the
terminal has a localized code page by default.
Perhaps Notepad will write text files with the localized code page by
default, but that's an application choice...

-- 
CPython 3.2.2 | Windows NT 6.1.7601.17640

[toc] | [prev] | [next] | [standalone]


#20259

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-02-12 09:12 +0000
Message-ID<4f378298$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to#20254
On Sun, 12 Feb 2012 01:05:35 -0600, Andrew Berg wrote:

> On 2/12/2012 12:10 AM, Steven D'Aprano wrote:
>> It's not just UTF8 either, but nearly all encodings. You can't even
>> expect to avoid problems if you stick to nothing but Windows, because
>> Windows' default encoding is localised: a file generated in (say)
>> Israel or Japan or Germany will use a different code page (encoding) by
>> default than one generated in (say) the US, Canada or UK.
> Generated by what? Windows will store a locale value for programs to
> use, but programs use Unicode internally by default

Which programs? And we're not talking about what they use internally, but 
what they write to files.


> (i.e., API calls are
> Unicode unless they were built for old versions of Windows), and the
> default filesystem (NTFS) uses Unicode for file names. 

No. File systems do not use Unicode for file names. Unicode is an 
abstract mapping between code points and characters. File systems are 
written using bytes.

Suppose you're a fan of Russian punk bank Наӥв and you have a directory 
of their music. The file system doesn't store the Unicode code points 
1053 1072 1253 1074, it has to be encoded to a sequence of bytes first.

NTFS by default uses the UTF-16 encoding, which means the actual bytes 
written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading 
byte-order mark \xff\xfe).

Windows has two separate APIs, one for "wide" characters, the other for 
single bytes. Depending on which one you use, the directory will appear 
to be called Наӥв or 0å2.

But in any case, we're not talking about the file name encoding. We're 
talking about the contents of files. 


> AFAIK, only the
> terminal has a localized code page by default. Perhaps Notepad will
> write text files with the localized code page by default, but that's an
> application choice...

Exactly. And unless you know what encoding the application chooses, you 
will likely get an exception trying to read the file.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6  Next page →

Back to top | Article view | comp.lang.python


csiph-web