Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #20242 > unrolled thread
| Started by | Chris Angelico <rosuav@gmail.com> |
|---|---|
| First post | 2012-02-12 12:28 +1100 |
| Last post | 2012-02-15 11:56 +0200 |
| Articles | 20 on this page of 109 — 31 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 12:28 +1100
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 02:23 +0000
Re: Python usage numbers Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-11 18:36 -0800
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 15:38 +1100
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 05:51 +0000
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-12 17:08 +1100
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 10:48 -0500
Re: Python usage numbers Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-12 11:47 -0500
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 12:11 -0500
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:49 +0000
Re: Python usage numbers Dan Sommers <dan@tombstonezero.net> - 2012-02-12 15:55 +0000
Re: Python usage numbers rusi <rustompmody@gmail.com> - 2012-02-12 08:50 -0800
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 12:21 -0500
Re: Python usage numbers Nick Dokos <nicholas.dokos@hp.com> - 2012-02-12 12:36 -0500
entering unicode (was Python usage numbers) rusi <rustompmody@gmail.com> - 2012-02-12 19:09 -0800
Re: entering unicode (was Python usage numbers) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-19 03:44 +0000
Re: entering unicode (was Python usage numbers) rusi <rustompmody@gmail.com> - 2012-02-19 00:52 -0800
How do you Unicode proponents type your non-ASCII characters? (was: Python usage numbers) Ben Finney <ben+python@benfinney.id.au> - 2012-02-13 09:43 +1100
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:56 +0000
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 10:13 -0500
Re: Python usage numbers Terry Reedy <tjreedy@udel.edu> - 2012-02-12 17:07 -0500
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 17:22 -0500
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 09:14 +1100
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 17:27 -0500
Re: Python usage numbers Dave Angel <davea@dejaviewphoto.com> - 2012-02-12 17:40 -0500
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 23:29 +0000
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 18:41 -0500
Re: Python usage numbers Dave Angel <d@davea.name> - 2012-02-12 19:03 -0500
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 11:59 +1100
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 20:11 -0500
Re: Python usage numbers Christian Heimes <lists@cheimes.de> - 2012-02-13 01:00 +0100
Re: Python usage numbers Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-12 21:37 -0500
Re: Python usage numbers Terry Reedy <tjreedy@udel.edu> - 2012-02-12 22:09 -0500
Re: Python usage numbers Roy Smith <roy@panix.com> - 2012-02-12 22:57 -0500
Re: Python usage numbers Ben Finney <ben+python@benfinney.id.au> - 2012-02-13 15:19 +1100
Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-13 12:26 -0600
Re: Python usage numbers jmfauth <wxjmfauth@gmail.com> - 2012-02-14 00:00 -0800
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 06:10 +0000
Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-12 01:05 -0600
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 09:12 +0000
Re: Python usage numbers Andrew Berg <bahamutzero8825@gmail.com> - 2012-02-12 05:11 -0600
Re: Python usage numbers Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-12 22:30 +0000
Re: Python usage numbers Dave Angel <d@davea.name> - 2012-02-12 17:50 -0500
Re: Python usage numbers Peter Pearson <ppearson@nowhere.invalid> - 2012-02-12 17:58 +0000
Re: Python usage numbers Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-12 20:48 -0800
Re: Python usage numbers Chris Angelico <rosuav@gmail.com> - 2012-02-13 16:03 +1100
OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-13 08:05 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 08:01 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-13 16:12 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 08:27 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-13 11:38 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 13:01 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 08:27 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-13 21:46 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-14 00:19 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 17:07 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-13 18:29 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-17 17:13 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-18 13:13 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-18 02:39 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-18 00:28 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-18 07:02 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-18 16:15 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-18 10:34 -0800
Re: OT: Entitlements [was Re: Python usage numbers] random joe <pywin32@gmail.com> - 2012-02-18 10:49 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-02-26 12:14 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Terry Reedy <tjreedy@udel.edu> - 2012-02-18 04:16 -0500
Re: OT: Entitlements [was Re: Python usage numbers] John O'Hagan <research@johnohagan.com> - 2012-02-14 19:41 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:21 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-15 11:44 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 17:26 -0800
Re: OT: Entitlements [was Re: Python usage numbers] John O'Hagan <research@johnohagan.com> - 2012-02-15 19:56 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-15 07:04 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-15 15:18 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-15 08:27 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-15 17:16 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Ian Kelly <ian.g.kelly@gmail.com> - 2012-02-15 09:46 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Albert van der Horst <albert@spenarnc.xs4all.nl> - 2012-02-26 12:44 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-26 12:35 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-27 07:50 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-26 14:32 -0800
Re: OT: Entitlements Ben Finney <ben+python@benfinney.id.au> - 2012-02-27 07:46 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 07:47 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Michael Torrie <torriem@gmail.com> - 2012-02-13 14:46 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-13 16:39 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Michael Torrie <torriem@gmail.com> - 2012-02-13 18:36 -0700
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-14 12:37 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-17 17:37 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Tim Wintle <tim.wintle@teamrubber.com> - 2012-02-13 16:41 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:40 -0800
RE: OT: Entitlements [was Re: Python usage numbers] "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-02-17 20:09 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-14 11:31 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Devin Jeanpierre <jeanpierreda@gmail.com> - 2012-02-14 07:06 -0500
Re: OT: Entitlements [was Re: Python usage numbers] Rick Johnson <rantingrickjohnson@gmail.com> - 2012-02-14 16:48 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Chris Angelico <rosuav@gmail.com> - 2012-02-15 12:32 +1100
Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-15 09:47 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Arnaud Delobelle <arnodel@gmail.com> - 2012-02-15 09:58 +0000
Re: OT: Entitlements [was Re: Python usage numbers] Duncan Booth <duncan.booth@invalid.invalid> - 2012-02-15 10:04 +0000
Kill files [was Re: OT: Entitlements [was Re: Python usage numbers]] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-02-15 10:27 +0000
Re: Kill files [was Re: OT: Entitlements [was Re: Python usage numbers]] Ethan Furman <ethan@stoneleaf.us> - 2012-02-15 11:29 -0800
Re: OT: Entitlements [was Re: Python usage numbers] rusi <rustompmody@gmail.com> - 2012-02-14 04:56 -0800
Re: OT: Entitlements [was Re: Python usage numbers] Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-02-14 09:37 -0500
Re: Python usage numbers Matej Cepl <mcepl@redhat.com> - 2012-02-12 09:14 +0100
Re: Python usage numbers Matej Cepl <mcepl@redhat.com> - 2012-02-12 09:26 +0100
Re: Python usage numbers Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-02-12 12:11 +0000
Re: Python usage numbers alister <alister.ware@ntlworld.com> - 2012-02-12 18:55 +0000
Re: Python usage numbers jmfauth <wxjmfauth@gmail.com> - 2012-02-12 11:52 -0800
French and IDLE on Windows (was Re: Python usage numbers) Terry Reedy <tjreedy@udel.edu> - 2012-02-12 18:30 -0500
Re: Python usage numbers Anssi Saari <as@sci.fi> - 2012-02-15 11:56 +0200
Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6 Next page →
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-02-12 17:07 -0500 |
| Message-ID | <mailman.5738.1329084478.27778.python-list@python.org> |
| In reply to | #20272 |
On 2/12/2012 10:13 AM, Roy Smith wrote: > Exactly.<soapbox class="wise-old-geezer">. ASCII was so successful > at becoming a universal standard which lasted for decades, I think you are overstating the universality and length. I used a machine in the 1970s with 60-bit words that could be interpreted as 10 6-bit characters. IBM used EBCDIC at least into the 1980s. The UCLA machine I used had a translator for ascii terminals that connected by modems. I remember discussing the translation table with the man in charge of it. Dedicated wordprocessing machines of the 70s and 80s *had* to use something other than plain ascii, as it is inadequate for business text, as opposed to pure computation and labeled number tables. Whether they used extended ascii or something else, I have no idea. Ascii was, however, as far as I know, the universal basis for the new personal computers starting about 1975, and most importantly, for the IBM PC. But even that actually used its version of extended ascii, as did each wordprocessing program. > people who > grew up with it don't realize there was once any other way. Not just > EBCDIC, but also SIXBIT, RAD-50, tilt/rotate, packed card records, > and so on. Transcoding was a way of life, and if you didn't know what > you were starting with and aiming for, it was hopeless. But because of the limitation of ascii on a worldwide, as opposed to American basis, we ended up with 100-200 codings for almost as many character sets. This is because the idea of ascii was applied by each nation or language group individually to their local situation. > Kind of like now where we are again with Unicode.</soapbox> The situation before ascii is like where we ended up *before* unicode. Unicode aims to replace all those byte encoding and character sets with *one* byte encoding for *one* character set, which will be a great simplification. It is the idea of ascii applied on a global rather that local basis. Let me repeat. Unicode and utf-8 is a solution to the mess, not the cause. Perhaps we should have a synonym for utf-8: escii, for Earthian Standard Code for Information Interchange. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-02-12 17:22 -0500 |
| Message-ID | <roy-F2390D.17223312022012@news.panix.com> |
| In reply to | #20296 |
In article <mailman.5738.1329084478.27778.python-list@python.org>, Terry Reedy <tjreedy@udel.edu> wrote: > Let me repeat. Unicode and utf-8 is a solution to the mess, not the > cause. Perhaps we should have a synonym for utf-8: escii, for Earthian > Standard Code for Information Interchange. I'm not arguing that Unicode is where we need to get to. Just trying to give a little history.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-02-13 09:14 +1100 |
| Message-ID | <mailman.5739.1329084873.27778.python-list@python.org> |
| In reply to | #20272 |
On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote: > The situation before ascii is like where we ended up *before* unicode. > Unicode aims to replace all those byte encoding and character sets with > *one* byte encoding for *one* character set, which will be a great > simplification. It is the idea of ascii applied on a global rather that > local basis. Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so are UTF-16, UTF-32. and as many more as you could hope for. But broadly yes, Unicode IS the solution. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-02-12 17:27 -0500 |
| Message-ID | <roy-E2270B.17273412022012@news.panix.com> |
| In reply to | #20297 |
In article <mailman.5739.1329084873.27778.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote: > > The situation before ascii is like where we ended up *before* unicode. > > Unicode aims to replace all those byte encoding and character sets with > > *one* byte encoding for *one* character set, which will be a great > > simplification. It is the idea of ascii applied on a global rather that > > local basis. > > Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so > are UTF-16, UTF-32. and as many more as you could hope for. But > broadly yes, Unicode IS the solution. I could hope for one and only one, but I know I'm just going to be disapointed. The last project I worked on used UTF-8 in most places, but also used some C and Java libraries which were only available for UTF-16. So it was transcoding hell all over the place. Hopefully, we will eventually reach the point where storage is so cheap that nobody minds how inefficient UTF-32 is and we all just start using that. Life will be a lot simpler then. No more transcoding, a string will just as many bytes as it is characters, and everybody will be happy again.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@dejaviewphoto.com> |
|---|---|
| Date | 2012-02-12 17:40 -0500 |
| Message-ID | <mailman.5740.1329086473.27778.python-list@python.org> |
| In reply to | #20299 |
On 02/12/2012 05:27 PM, Roy Smith wrote: > In article<mailman.5739.1329084873.27778.python-list@python.org>, > Chris Angelico<rosuav@gmail.com> wrote: > >> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu> wrote: >>> The situation before ascii is like where we ended up *before* unicode. >>> Unicode aims to replace all those byte encoding and character sets with >>> *one* byte encoding for *one* character set, which will be a great >>> simplification. It is the idea of ascii applied on a global rather that >>> local basis. >> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so >> are UTF-16, UTF-32. and as many more as you could hope for. But >> broadly yes, Unicode IS the solution. > I could hope for one and only one, but I know I'm just going to be > disapointed. The last project I worked on used UTF-8 in most places, > but also used some C and Java libraries which were only available for > UTF-16. So it was transcoding hell all over the place. > > Hopefully, we will eventually reach the point where storage is so cheap > that nobody minds how inefficient UTF-32 is and we all just start using > that. Life will be a lot simpler then. No more transcoding, a string > will just as many bytes as it is characters, and everybody will be happy > again. Keep your in-memory character strings as Unicode, and only serialize(encode) them when they go to/from a device, or to/from anachronistic code. Then the cost is realized at the point of the problem. No different than when deciding how to serialize any other data type. Do it only at the point of entry/exit of your program. But as long as devices are addressed as bytes, or as anything smaller than 32bit thingies, you will have encoding issues when writing to the device, and decoding issues when reading. At the very least, you have big-endian/little-endian ways to encode that UCS-4 code point.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-02-12 23:29 +0000 |
| Message-ID | <4f384b6e$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #20299 |
On Sun, 12 Feb 2012 17:27:34 -0500, Roy Smith wrote: > In article <mailman.5739.1329084873.27778.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy <tjreedy@udel.edu> wrote: >> > The situation before ascii is like where we ended up *before* >> > unicode. Unicode aims to replace all those byte encoding and >> > character sets with *one* byte encoding for *one* character set, >> > which will be a great simplification. It is the idea of ascii applied >> > on a global rather that local basis. >> >> Unicode doesn't deal with byte encodings; UTF-8 is an encoding, but so >> are UTF-16, UTF-32. and as many more as you could hope for. But broadly >> yes, Unicode IS the solution. > > I could hope for one and only one, but I know I'm just going to be > disapointed. The last project I worked on used UTF-8 in most places, > but also used some C and Java libraries which were only available for > UTF-16. So it was transcoding hell all over the place. Um, surely the solution to that is to always call a simple wrapper function to the UTF-16 code to handle the transcoding? What do the Design Patterns people call it, a facade? No, an adapter. (I never remember the names...) Instead of calling library.foo() which only outputs UTF-16, write a wrapper myfoo() which calls foo, captures its output and transcribes to UTF-8. You have to do that once (per function), but now it works from everywhere, so long as you remember to always call myfoo instead of foo. > Hopefully, we will eventually reach the point where storage is so cheap > that nobody minds how inefficient UTF-32 is and we all just start using > that. Life will be a lot simpler then. No more transcoding, a string > will just as many bytes as it is characters, and everybody will be happy > again. I think you mean 4 times as many bytes as characters. Unless you have 32 bit bytes :) -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-02-12 18:41 -0500 |
| Message-ID | <roy-F23C47.18412012022012@news.panix.com> |
| In reply to | #20310 |
In article <4f384b6e$0$29986$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > > I could hope for one and only one, but I know I'm just going to be > > disapointed. The last project I worked on used UTF-8 in most places, > > but also used some C and Java libraries which were only available for > > UTF-16. So it was transcoding hell all over the place. > > Um, surely the solution to that is to always call a simple wrapper > function to the UTF-16 code to handle the transcoding? What do the Design > Patterns people call it, a facade? No, an adapter. (I never remember the > names...) I am familiar with the concept. It was ICU. A very big library. Lots of calls. I don't remember the details, I'm sure we wrote wrappers. It was still a mess. > > Hopefully, we will eventually reach the point where storage is so cheap > > that nobody minds how inefficient UTF-32 is and we all just start using > > that. Life will be a lot simpler then. No more transcoding, a string > > will just as many bytes as it is characters, and everybody will be happy > > again. > > I think you mean 4 times as many bytes as characters. Unless you have 32 > bit bytes :) Yes, exactly.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-02-12 19:03 -0500 |
| Message-ID | <mailman.5747.1329091472.27778.python-list@python.org> |
| In reply to | #20310 |
On 02/12/2012 06:29 PM, Steven D'Aprano wrote: > On Sun, 12 Feb 2012 17:27:34 -0500, Roy Smith wrote: > >> <SNIP> >> Hopefully, we will eventually reach the point where storage is so cheap >> that nobody minds how inefficient UTF-32 is and we all just start using >> that. Life will be a lot simpler then. No more transcoding, a string >> will just as many bytes as it is characters, and everybody will be happy >> again. > I think you mean 4 times as many bytes as characters. Unless you have 32 > bit bytes :) > > Until you have 32 bit bytes, you'll continue to have encodings, even if only a couple of them. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-02-13 11:59 +1100 |
| Message-ID | <mailman.5750.1329094801.27778.python-list@python.org> |
| In reply to | #20310 |
On Mon, Feb 13, 2012 at 11:03 AM, Dave Angel <d@davea.name> wrote: > On 02/12/2012 06:29 PM, Steven D'Aprano wrote: >> I think you mean 4 times as many bytes as characters. Unless you have 32 >> bit bytes :) >> >> > Until you have 32 bit bytes, you'll continue to have encodings, even if only > a couple of them. The advantage, though, is that you can always know how many bytes to read for X characters. In ASCII, you allocate 80 bytes of storage and you can store 80 characters. In UTF-8, if you want an 80-character buffer, you can probably get away with allocating 240 characters... but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you know you can store them. Also, you know exactly where the 17th character is; in UTF-8, you have to count. That's a huge advantage for in-memory strings; but is it useful on disk, where (as likely as not) you're actually looking for lines, which you still have to scan for? I'm thinking not, so it makes sense to use a smaller disk image than UTF-32 - less total bytes means less sectors to read/write, which translates fairly directly into performance. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-02-12 20:11 -0500 |
| Message-ID | <roy-F2E8F4.20110312022012@news.panix.com> |
| In reply to | #20319 |
In article <mailman.5750.1329094801.27778.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > The advantage, though, is that you can always know how many bytes to > read for X characters. In ASCII, you allocate 80 bytes of storage and > you can store 80 characters. In UTF-8, if you want an 80-character > buffer, you can probably get away with allocating 240 characters... > but maybe not. In UTF-32, it's easy - just allocate 320 bytes and you > know you can store them. Also, you know exactly where the 17th > character is; in UTF-8, you have to count. That's a huge advantage for > in-memory strings; but is it useful on disk, where (as likely as not) > you're actually looking for lines, which you still have to scan for? > I'm thinking not, so it makes sense to use a smaller disk image than > UTF-32 - less total bytes means less sectors to read/write, which > translates fairly directly into performance. You might just write files compressed. My guess is that a typical gzipped UTF-32 text file will be smaller than the same data stored as uncompressed UTF-8.
[toc] | [prev] | [next] | [standalone]
| From | Christian Heimes <lists@cheimes.de> |
|---|---|
| Date | 2012-02-13 01:00 +0100 |
| Message-ID | <mailman.5746.1329091227.27778.python-list@python.org> |
| In reply to | #20272 |
Am 12.02.2012 23:07, schrieb Terry Reedy: > But because of the limitation of ascii on a worldwide, as opposed to > American basis, we ended up with 100-200 codings for almost as many > character sets. This is because the idea of ascii was applied by each > nation or language group individually to their local situation. You really learn to appreciate unicode when you have to deal with mixed languages in texts and old databases from the 70ties and 80ties. I'm working with books that contain medieval German, old German, modern German, English, French, Latin, Hebrew, Arabic, ancient and modern Greek, Rhaeto-Romanic, East European and more languages. Sometimes three or four languages are used in a single book. Some books are more than 700 years old and contain glyphs that aren't covered by unicode yet. Without unicode it would be virtually impossible to deal with it. Metadata for these books come from old and proprietary databases and are stored in a format that is optimized for magnetic tape. Most people will never have heard about ISO-5426 or ANSEL encoding or about file formats like MAB2, MARC or PICA. It took me quite some time to develop codecs to encode and decode an old and partly undocumented variable multibyte encodings that predates UTF-8 by about a decade. Of course every system interprets the undocumented parts slightly different ... Unicode and XML are bliss for metadata exchange and long term storage!
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2012-02-12 21:37 -0500 |
| Message-ID | <mailman.5751.1329100699.27778.python-list@python.org> |
| In reply to | #20272 |
On Sun, 12 Feb 2012 17:07:44 -0500, Terry Reedy <tjreedy@udel.edu>
wrote:
>I think you are overstating the universality and length. I used a
>machine in the 1970s with 60-bit words that could be interpreted as 10
>6-bit characters. IBM used EBCDIC at least into the 1980s. The UCLA
The Xerox Sigma series also used EBCDIC (probably not a surprise --
I believe the precursor company, Scientific Data Systems, was founded by
ex-IBM folk). One nice thing about EBCDIC was that, in hex, the
characters could be mapped quite easily with Hollerith cards -- the
lower nybble mapped to the card 0-9 rows, and the high nybble correlated
to the top card rows.
Of course, the Sigma was a weird machine all by itself, what with
over 200 discrete hardware interrupt vectors, four-bank interleaved
memory, etc.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-02-12 22:09 -0500 |
| Message-ID | <mailman.5752.1329102603.27778.python-list@python.org> |
| In reply to | #20272 |
On 2/12/2012 5:14 PM, Chris Angelico wrote: > On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu> wrote: >> The situation before ascii is like where we ended up *before* unicode. >> Unicode aims to replace all those byte encoding and character sets with >> *one* byte encoding for *one* character set, which will be a great >> simplification. It is the idea of ascii applied on a global rather that >> local basis. > > Unicode doesn't deal with byte encodings; UTF-8 is an encoding, The Unicode Standard specifies 3 UTF storage formats* and 8 UTF byte-oriented transmission formats. UTF-8 is the most common of all encodings for web pages. (And ascii pages are utf-8 also.) It is the only one of the 8 most of us need to much bother with. Look here for the list http://www.unicode.org/glossary/#U and for details look in various places in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf > but so are UTF-16, UTF-32. > and as many more as you could hope for. All the non-UTF 'as many more as you could hope for' encodings are not part of Unicode. * The new internal unicode scheme for 3.3 is pretty much a mixture of the 3 storage formats (I am of course, skipping some details) by using the widest one needed for each string. The advantage is avoiding problems with each of the three. The disadvantage is greater internal complexity, but that should be hidden from users. They will not need to care about the internals. They will be able to forget about 'narrow' versus 'wide' builds and the possible requirement to code differently for each. There will only be one scheme that works the same on all platforms. Most apps should require less space and about the same time. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-02-12 22:57 -0500 |
| Message-ID | <roy-753AB0.22570112022012@news.panix.com> |
| In reply to | #20323 |
In article <mailman.5752.1329102603.27778.python-list@python.org>,
Terry Reedy <tjreedy@udel.edu> wrote:
> On 2/12/2012 5:14 PM, Chris Angelico wrote:
> > On Mon, Feb 13, 2012 at 9:07 AM, Terry Reedy<tjreedy@udel.edu> wrote:
> >> The situation before ascii is like where we ended up *before* unicode.
> >> Unicode aims to replace all those byte encoding and character sets with
> >> *one* byte encoding for *one* character set, which will be a great
> >> simplification. It is the idea of ascii applied on a global rather that
> >> local basis.
> >
> > Unicode doesn't deal with byte encodings; UTF-8 is an encoding,
>
> The Unicode Standard specifies 3 UTF storage formats* and 8 UTF
> byte-oriented transmission formats. UTF-8 is the most common of all
> encodings for web pages. (And ascii pages are utf-8 also.) It is the
> only one of the 8 most of us need to much bother with. Look here for the
> list
> http://www.unicode.org/glossary/#U
> and for details look in various places in
> http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
>
> > but so are UTF-16, UTF-32.
> > and as many more as you could hope for.
>
> All the non-UTF 'as many more as you could hope for' encodings are not
> part of Unicode.
>
> * The new internal unicode scheme for 3.3 is pretty much a mixture of
> the 3 storage formats (I am of course, skipping some details) by using
> the widest one needed for each string. The advantage is avoiding
> problems with each of the three. The disadvantage is greater internal
> complexity, but that should be hidden from users. They will not need to
> care about the internals. They will be able to forget about 'narrow'
> versus 'wide' builds and the possible requirement to code differently
> for each. There will only be one scheme that works the same on all
> platforms. Most apps should require less space and about the same time.
All that is just fine, but what the heck are we going to do about ascii
art, that's what I want to know. Python just won't be the same in UTF-8.
/^\/^\
_|__| O|
\/ /~ \_/ \
\____|__________/ \
\_______ \
`\ \ \
| | \
/ / \
/ / \\
/ / \ \
/ / \ \
/ / _----_ \ \
/ / _-~ ~-_ | |
( ( _-~ _--_ ~-_ _/ |
\ ~-____-~ _-~ ~-_ ~-_-~ /
~-_ _-~ ~-_ _-~ - jurcy -
~--______-~ ~-___-~
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2012-02-13 15:19 +1100 |
| Message-ID | <87ty2v1i2m.fsf@benfinney.id.au> |
| In reply to | #20326 |
Roy Smith <roy@panix.com> writes: > All that is just fine, but what the heck are we going to do about ascii > art, that's what I want to know. Python just won't be the same in > UTF-8. If it helps, ASCII art *is* UTF-8 art. So it will be the same in UTF-8. Or maybe you already knew that, and your sarcasm was lost with the high bit. -- \ “We are all agreed that your theory is crazy. The question that | `\ divides us is whether it is crazy enough to have a chance of | _o__) being correct.” —Niels Bohr (to Wolfgang Pauli), 1958 | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2012-02-13 12:26 -0600 |
| Message-ID | <mailman.5768.1329157620.27778.python-list@python.org> |
| In reply to | #20329 |
On 2/12/2012 10:19 PM, Ben Finney wrote: > If it helps, ASCII art *is* UTF-8 art. So it will be the same in UTF-8. As will non-ASCII text art: /l、 ゙(゚、 。 7 l、゙ ~ヽ じしf_, )ノ -- CPython 3.2.2 | Windows NT 6.1.7601.17640
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2012-02-14 00:00 -0800 |
| Message-ID | <2076e822-6225-449d-8d93-bf8d1627e77a@l14g2000vbe.googlegroups.com> |
| In reply to | #20323 |
On 13 fév, 04:09, Terry Reedy <tjre...@udel.edu> wrote: > > > * The new internal unicode scheme for 3.3 is pretty much a mixture of > the 3 storage formats (I am of course, skipping some details) by using > the widest one needed for each string. The advantage is avoiding > problems with each of the three. The disadvantage is greater internal > complexity, but that should be hidden from users. They will not need to > care about the internals. They will be able to forget about 'narrow' > versus 'wide' builds and the possible requirement to code differently > for each. There will only be one scheme that works the same on all > platforms. Most apps should require less space and about the same time. > > -- Python 2 was built for ascii users. Now, Python 3(.3) is *optimized* for the ascii users. And the rest of the crowd? Not so sure, French users (among others) who can not write their texts will iso-8859-1/latin1 will be very happy. No doubts, it will work. Is this however the correct approach? jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-02-12 06:10 +0000 |
| Message-ID | <4f3757cc$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #20245 |
On Sat, 11 Feb 2012 18:36:52 -0800, Rick Johnson wrote: >> "I have a file containing text. I can open it in an editor and see it's >> nearly all ASCII text, except for a few weird and bizarre characters >> like £ © ± or ö. In Python 2, I can read that file fine. In Python 3 I >> get an error. What should I do that requires no thought?" >> >> Obvious answers: > > the most obvious answer would be to read the file WITHOUT worrying about > asinine encoding. Your mad leet reading comprehension skillz leave me in awe Rick. If you try to read a file containing non-ASCII characters encoded using UTF8 on Windows without explicitly specifying either UTF8 as the encoding, or an error handler, you will get an exception. It's not just UTF8 either, but nearly all encodings. You can't even expect to avoid problems if you stick to nothing but Windows, because Windows' default encoding is localised: a file generated in (say) Israel or Japan or Germany will use a different code page (encoding) by default than one generated in (say) the US, Canada or UK. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2012-02-12 01:05 -0600 |
| Message-ID | <mailman.5718.1329030346.27778.python-list@python.org> |
| In reply to | #20252 |
On 2/12/2012 12:10 AM, Steven D'Aprano wrote: > It's not just UTF8 either, but nearly all encodings. You can't even > expect to avoid problems if you stick to nothing but Windows, because > Windows' default encoding is localised: a file generated in (say) Israel > or Japan or Germany will use a different code page (encoding) by default > than one generated in (say) the US, Canada or UK. Generated by what? Windows will store a locale value for programs to use, but programs use Unicode internally by default (i.e., API calls are Unicode unless they were built for old versions of Windows), and the default filesystem (NTFS) uses Unicode for file names. AFAIK, only the terminal has a localized code page by default. Perhaps Notepad will write text files with the localized code page by default, but that's an application choice... -- CPython 3.2.2 | Windows NT 6.1.7601.17640
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-02-12 09:12 +0000 |
| Message-ID | <4f378298$0$29986$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #20254 |
On Sun, 12 Feb 2012 01:05:35 -0600, Andrew Berg wrote: > On 2/12/2012 12:10 AM, Steven D'Aprano wrote: >> It's not just UTF8 either, but nearly all encodings. You can't even >> expect to avoid problems if you stick to nothing but Windows, because >> Windows' default encoding is localised: a file generated in (say) >> Israel or Japan or Germany will use a different code page (encoding) by >> default than one generated in (say) the US, Canada or UK. > Generated by what? Windows will store a locale value for programs to > use, but programs use Unicode internally by default Which programs? And we're not talking about what they use internally, but what they write to files. > (i.e., API calls are > Unicode unless they were built for old versions of Windows), and the > default filesystem (NTFS) uses Unicode for file names. No. File systems do not use Unicode for file names. Unicode is an abstract mapping between code points and characters. File systems are written using bytes. Suppose you're a fan of Russian punk bank Наӥв and you have a directory of their music. The file system doesn't store the Unicode code points 1053 1072 1253 1074, it has to be encoded to a sequence of bytes first. NTFS by default uses the UTF-16 encoding, which means the actual bytes written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading byte-order mark \xff\xfe). Windows has two separate APIs, one for "wide" characters, the other for single bytes. Depending on which one you use, the directory will appear to be called Наӥв or 0å2. But in any case, we're not talking about the file name encoding. We're talking about the contents of files. > AFAIK, only the > terminal has a localized code page by default. Perhaps Notepad will > write text files with the localized code page by default, but that's an > application choice... Exactly. And unless you know what encoding the application chooses, you will likely get an exception trying to read the file. -- Steven
[toc] | [prev] | [next] | [standalone]
Page 2 of 6 — ← Prev page 1 [2] 3 4 5 6 Next page →
Back to top | Article view | comp.lang.python
csiph-web