Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50110 > unrolled thread
| Started by | blatt <ferdy.blatsco@gmail.com> |
|---|---|
| First post | 2013-07-07 17:22 -0700 |
| Last post | 2013-07-13 04:51 +0000 |
| Articles | 20 on this page of 49 — 15 participants |
Back to article view | Back to comp.lang.python
hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-14 01:20 -0700 |
| Message-ID | <69df4d48-4cb8-4102-b80c-247f8fd07f65@googlegroups.com> |
| In reply to | #50611 |
Le samedi 13 juillet 2013 21:02:24 UTC+2, Dave Angel a écrit : > On 07/13/2013 10:37 AM, wxjmfauth@gmail.com wrote: > > > > > > Fortunately for us, Python (in version 3.3 and later) and Pike did it > > right. Some day the others may decide to do similarly. > > > ----------- Possible but I doubt. For a very simple reason, the latin-1 block: considered and accepted today as beeing a Unicode design mistake. jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-14 10:44 +0000 |
| Message-ID | <51e280fc$0$9505$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50634 |
On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote: > For a very simple reason, the latin-1 block: considered and accepted > today as beeing a Unicode design mistake. Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational Character Set", which goes back to 1983. ISO-8859-1 was first published in 1985, and was in use on Commodore computers the same year. The concept of Unicode wasn't even started until 1987, and the first draft wasn't published until the end of 1990. Unicode wasn't considered ready for production use until 1991, six years after Latin-1 was already in use in people's computers. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-14 06:44 -0700 |
| Message-ID | <ea260eab-4361-4378-b61b-d33224d2ff5d@googlegroups.com> |
| In reply to | #50638 |
Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :
> On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:
>
>
>
> > For a very simple reason, the latin-1 block: considered and accepted
>
> > today as beeing a Unicode design mistake.
>
>
>
> Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational
>
> Character Set", which goes back to 1983. ISO-8859-1 was first published
>
> in 1985, and was in use on Commodore computers the same year.
>
>
>
> The concept of Unicode wasn't even started until 1987, and the first
>
> draft wasn't published until the end of 1990. Unicode wasn't considered
>
> ready for production use until 1991, six years after Latin-1 was already
>
> in use in people's computers.
>
>
>
>
>
>
>
> --
>
> Steven
------
"Unicode" (in fact iso-14xxx) was not created in one
night (Deus ex machina).
What's count today is this:
>>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
>>> timeit.repeat("a = 'hundreœ'; 'x' in a")
[0.23955250303158593, 0.2195812612416752, 0.22133896997401692]
>>>
>>>
>>> sys.getsizeof('d')
26
>>> sys.getsizeof('œ')
40
>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-24 06:28 -0700 |
| Message-ID | <696caa4f-142a-4e46-88fc-090da94ced2e@googlegroups.com> |
| In reply to | #50640 |
I do not find the thread, where a Python core dev spoke
about French, so I'm putting here.
This stupid Flexible String Representation splits Unicode
in chunks and one of these chunks is latin-1 (iso-8859-1).
If we consider that latin-1 is unusable for 17 (seventeen)
European languages based on the latin alphabet, one can not
say Python is really well prepared.
Most of the problems are coming from the extensive usage of
diacritics in these languages. Thanks to the FSR again,
working with normalized forms does not work very well. At
least, there is some consistency.
Now, if we consider that most of the new characters will
be part of the BMP ("daily" used chars), it is hard to
present Python as a modern language. It sticks more
to the past and it not really prepared for the future,
the acceptance of new chars like ẞ or the new Turkish lira
sign ((U+20BA).
>>> sys.getsizeof('š')
40
>>> sys.getsizeof('0')
26
14 bytes to encode a non-latin-1 char is not so bad.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2013-07-14 09:17 +1000 |
| Message-ID | <KP6dnXNvYYEDfXzMnZ2dnUVZ_qCdnZ2d@westnet.com.au> |
| In reply to | #50596 |
wxjmfauth@gmail.com:
> The FSR is naive and badly working. I can not force people
> to understand the coding of the characters [*].
You could at least *try*.
If there really was a problem with the FSR and you truly understood
this problem then surely you would be able to communicate the problem to
at least one person on the list.
Neil
[toc] | [prev] | [next] | [standalone]
| From | ferdy.blatsco@gmail.com |
|---|---|
| Date | 2013-07-08 10:53 -0700 |
| Message-ID | <7b6fc645-8bf3-4681-821c-38fb1fa1d191@googlegroups.com> |
| In reply to | #50110 |
Hi Steven, thank you for your reply... I really needed another python guru which is also an English teacher! Sorry if English is not my mother tongue... "uncorrect" instead of "incorrect" (I misapplied the "similarity principle" like "unpleasant...>...uncorrect"). Apart from these trifles, you said: >> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă". Not using python 3, for me (a programmer which was present at the beginning of computer science, badly interacting with many languages from assembler to Fortran and from c to Pascal and so on) it was an hard job to arrange the abrupt transition from characters only equal to bytes to some special characters defined with 2, 3 bytes and even more. I should have preferred another solution... but i'm not Guido....! I said: > in the first version the utf-8 conversion to hex was shown horizontally And you replied: >> Oh! We're supposed to read the output *downwards*! You are correct, but I was only referring to "special characters"... My main concern was compactness of output and besides that every group of bytes used for defining "special characters" is well represented with high nibble in the range outside ascii 0-127. Your following observations are connected more or less to the above point and sorry if the interpretation of output... sucks! I think that, for the interested user, all the question is of minor importance. Only another point is relevant for me: >> The loop variable just gets reset once it reaches the top of the loop >> again. Apart your kind observation (... "hideously ugly to read") referring to my code snippet incrementing the loop variable... you are correct. I will never make the same mistake! Bye, Blatt.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-09 04:07 +1000 |
| Message-ID | <mailman.4393.1373306845.3114.python-list@python.org> |
| In reply to | #50165 |
On Tue, Jul 9, 2013 at 3:53 AM, <ferdy.blatsco@gmail.com> wrote: >>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă". > Not using python 3, for me (a programmer which was present at the beginning of > computer science, badly interacting with many languages from assembler to > Fortran and from c to Pascal and so on) it was an hard job to arrange the > abrupt transition from characters only equal to bytes to some special > characters defined with 2, 3 bytes and even more. Even back then, bytes and characters were different. 'A' is a character, 0x41 is a byte. And they correspond 1:1 if and only if you know that your characters are represented in ASCII. Other encodings (eg EBCDIC) mapped things differently. The only difference now is that more people are becoming aware that there are more than 256 characters in the world. Like Magic 2014 and its treatment of Slivers, at some point you're going to have to master the difference between bytes and characters, or else be eternally hacking around stuff in your code, so now is as good a time as any. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-07-08 16:56 -0400 |
| Message-ID | <mailman.4397.1373317033.3114.python-list@python.org> |
| In reply to | #50165 |
On 07/08/2013 01:53 PM, ferdy.blatsco@gmail.com wrote: > Hi Steven, > > thank you for your reply... I really needed another python guru which > is also an English teacher! Sorry if English is not my mother tongue... > "uncorrect" instead of "incorrect" (I misapplied the "similarity > principle" like "unpleasant...>...uncorrect"). > > Apart from these trifles, you said: >>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă". > Not using python 3, for me (a programmer which was present at the beginning of > computer science, badly interacting with many languages from assembler to > Fortran and from c to Pascal and so on) it was an hard job to arrange the > abrupt transition from characters only equal to bytes to some special > characters defined with 2, 3 bytes and even more. Characters do not have a width. They are Unicode code points, an abstraction. It's only when you encode them in byte strings that a code point takes on any specific width. And some encodings go to one-byte strings (and get errors for characters that don't match), some go to two-bytes each, some variable, etc. > I should have preferred another solution... but i'm not Guido....! But Unicode has nothing to do with Guido, and it has existed for about 25 years (if I recall correctly). It's only that Python 3 is finally embracing it, and making it the default type for characters, as it should be. As far as I'm concerned, the only reason it shouldn't have been done long ago was that programs were trying to fit on 640k DOS machines. Even before Unicode, there were multi-byte encodings around (eg. Microsoft's MBCS), and each was thoroughly incompatible with all the others. And the problem with one-byte encodings is that if you need to use a Greek currency symbol in a document that's mostly Norwegian (or some such combination of characters), there might not be ANY valid way to do it within a single "character set." Python 2 supports all the same Unicode features as 3; it's just that it defaults to byte strings. So it's HARDER to get it right. Except for special purpose programs like a file dumper, it's usually unnecessary for a Python 3 programmer to deal with individual bytes from a byte string. Text files are a bunch of bytes, and somebody has to interpret them as characters. If you let open() handle it, and if you give it the correct encoding, it just works. Internally, all strings are Unicode, and you don't care where they came from, or what human language they may have characters from. You can combine strings from multiple places, without much worry that they might interfere. Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS) from the beginning (approx 1992), and has had Unicode versions of each of its API's for nearly as long. I appreciate you've been around a long time, and worked in a lot of languages. I've programmed professionally in at least 35 languages since 1967. But we've come a long way from the 6bit characters I used in 1968. At that time, we packed them 10 characters to each word. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-07-09 12:22 +0000 |
| Message-ID | <b42dk9F56csU3@mid.individual.net> |
| In reply to | #50171 |
On 2013-07-08, Dave Angel <davea@davea.name> wrote: > I appreciate you've been around a long time, and worked in a > lot of languages. I've programmed professionally in at least > 35 languages since 1967. But we've come a long way from the > 6bit characters I used in 1968. At that time, we packed them > 10 characters to each word. One of the first Python project I undertook was a program to dump the ZSCII strings from Infocom game files. They are mostly packed one character per 5 bits, with escapes to (I had to recheck the Z-machine spec) latin-1. Oh, those clever implementors: thwarting hexdumping cheaters and cramming their games onto microcomputers with one blow. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-07-09 08:54 -0400 |
| Message-ID | <mailman.4447.1373374482.3114.python-list@python.org> |
| In reply to | #50237 |
On 07/09/2013 08:22 AM, Neil Cerutti wrote: > On 2013-07-08, Dave Angel <davea@davea.name> wrote: >> I appreciate you've been around a long time, and worked in a >> lot of languages. I've programmed professionally in at least >> 35 languages since 1967. But we've come a long way from the >> 6bit characters I used in 1968. At that time, we packed them >> 10 characters to each word. > > One of the first Python project I undertook was a program to dump > the ZSCII strings from Infocom game files. They are mostly packed > one character per 5 bits, with escapes to (I had to recheck the > Z-machine spec) latin-1. Oh, those clever implementors: thwarting > hexdumping cheaters and cramming their games onto microcomputers > with one blow. > In 1973 I played with encoding some data that came over the public airwaves (I never learned the specific radio technology, probably used sidebands of FM stations). The data was encoded, with most characters taking 5 bits, and the decoded stream was like a ticker-tape. With some hardware and the right software, you could track Wall Street in real time. (Or maybe it had the usual 15 minute delay). Obviously, they didn't publish the spec any place. But some others had the beginnings of a decoder, and I expanded on that. We never did anything with it, it was just an interesting challenge. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-07-09 13:00 +0000 |
| Message-ID | <b42frvF5qscU1@mid.individual.net> |
| In reply to | #50238 |
On 2013-07-09, Dave Angel <davea@davea.name> wrote: >> One of the first Python project I undertook was a program to >> dump the ZSCII strings from Infocom game files. They are >> mostly packed one character per 5 bits, with escapes to (I had >> to recheck the Z-machine spec) latin-1. Oh, those clever >> implementors: thwarting hexdumping cheaters and cramming their >> games onto microcomputers with one blow. > > In 1973 I played with encoding some data that came over the > public airwaves (I never learned the specific radio technology, > probably used sidebands of FM stations). The data was encoded, > with most characters taking 5 bits, and the decoded stream was > like a ticker-tape. With some hardware and the right software, > you could track Wall Street in real time. (Or maybe it had the > usual 15 minute delay). > > Obviously, they didn't publish the spec any place. But some > others had the beginnings of a decoder, and I expanded on that. > We never did anything with it, it was just an interesting > challenge. Interestingly similar scheme. It wonder if 5-bit chars was a common compression scheme. The Z-machine spec was never officially published either. I believe a "task force" reverse engineered it sometime in the 90's. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Skip Montanaro <skip@pobox.com> |
|---|---|
| Date | 2013-07-09 08:18 -0500 |
| Message-ID | <mailman.4448.1373375918.3114.python-list@python.org> |
| In reply to | #50239 |
> It wonder if 5-bit chars was a > common compression scheme. http://en.wikipedia.org/wiki/List_of_binary_codes Baudot was pretty common, as I recall, though ASCII and EBCDIC ruled by the time I started punching cards. Skip
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-07-09 09:23 -0400 |
| Message-ID | <mailman.4449.1373376257.3114.python-list@python.org> |
| In reply to | #50239 |
On 07/09/2013 09:00 AM, Neil Cerutti wrote:
<SNIP>
> Interestingly similar scheme. It wonder if 5-bit chars was a
> common compression scheme. The Z-machine spec was never
> officially published either. I believe a "task force" reverse
> engineered it sometime in the 90's.
>
Baudot was 5 bits. It used shift-codes to get upper case and digits, if
I recall.
And ASCII was 7 bits so there could be one more for parity.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-07-08 22:38 +0100 |
| Message-ID | <mailman.4404.1373319468.3114.python-list@python.org> |
| In reply to | #50165 |
On 08/07/2013 21:56, Dave Angel wrote:
> On 07/08/2013 01:53 PM, ferdy.blatsco@gmail.com wrote:
>> Hi Steven,
>>
>> thank you for your reply... I really needed another python guru which
>> is also an English teacher! Sorry if English is not my mother tongue...
>> "uncorrect" instead of "incorrect" (I misapplied the "similarity
>> principle" like "unpleasant...>...uncorrect").
>>
>> Apart from these trifles, you said:
>>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
>> Not using python 3, for me (a programmer which was present at the beginning of
>> computer science, badly interacting with many languages from assembler to
>> Fortran and from c to Pascal and so on) it was an hard job to arrange the
>> abrupt transition from characters only equal to bytes to some special
>> characters defined with 2, 3 bytes and even more.
>
> Characters do not have a width.
[snip]
It depends what you mean by "width"! :-)
Try this (Python 3):
>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
AA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-09 07:49 +1000 |
| Message-ID | <mailman.4405.1373320188.3114.python-list@python.org> |
| In reply to | #50165 |
On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote: > But Unicode has nothing to do with Guido, and it has existed for about 25 > years (if I recall correctly). Depends how you measure. According to [1], the work kinda began back then (25 years ago being 1988), but it wasn't till 1991/92 that the spec was published. Also, the full Unicode range with multiple planes came about in 1996, with Unicode 2.0, so that could also be considered the beginning of Unicode. But that still means it's nearly old enough to drink, so programmers ought to be aware of it. [1] http://en.wikipedia.org/wiki/Unicode#History ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-09 06:53 +0000 |
| Message-ID | <51dbb372$0$6512$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50179 |
On Tue, 09 Jul 2013 07:49:45 +1000, Chris Angelico wrote: > On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote: >> But Unicode has nothing to do with Guido, and it has existed for about >> 25 years (if I recall correctly). > > Depends how you measure. According to [1], the work kinda began back > then (25 years ago being 1988), but it wasn't till 1991/92 that the spec > was published. Also, the full Unicode range with multiple planes came > about in 1996, with Unicode 2.0, so that could also be considered the > beginning of Unicode. But that still means it's nearly old enough to > drink, so programmers ought to be aware of it. Yes, yes, a thousand times yes. It's really not that hard to get the basics of Unicode. "When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough. So I have an announcement to make: if you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will." http://www.joelonsoftware.com/articles/Unicode.html Also: http://nedbatchelder.com/text/unipain.html To start with, if you're writing code for Python 2.x, and not using u'' for strings, then you're making a rod for your own back. Do yourself a favour and get into the habit of always using u'' strings in Python 2. I'll-start-taking-my-own-advice-next-week-I-promise-ly yrs, -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Joshua Landau <joshua.landau.ws@gmail.com> |
|---|---|
| Date | 2013-07-08 23:02 +0100 |
| Message-ID | <mailman.4406.1373321026.3114.python-list@python.org> |
| In reply to | #50165 |
On 8 July 2013 22:38, MRAB <python@mrabarnett.plus.com> wrote:
> On 08/07/2013 21:56, Dave Angel wrote:
>> Characters do not have a width.
>
> [snip]
>
> It depends what you mean by "width"! :-)
>
> Try this (Python 3):
>
>>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
> AA
Serious question: How would one find the width of a character by that
definition?
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-07-08 18:45 -0400 |
| Message-ID | <mailman.4407.1373323563.3114.python-list@python.org> |
| In reply to | #50165 |
On 07/08/2013 05:49 PM, Chris Angelico wrote: > On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote: >> But Unicode has nothing to do with Guido, and it has existed for about 25 >> years (if I recall correctly). > > Depends how you measure. According to [1], the work kinda began back > then (25 years ago being 1988), but it wasn't till 1991/92 that the > spec was published. Also, the full Unicode range with multiple planes > came about in 1996, with Unicode 2.0, so that could also be considered > the beginning of Unicode. But that still means it's nearly old enough > to drink, so programmers ought to be aware of it. > Well, then I'm glad I stuck the qualifier on it. I remember where I was working, and that company folded in 1992. I was working on NT long before its official release in 1993, and it used Unicode, even if the spec was sliding along. I'm sure I got unofficial versions of things through Microsoft, at the time. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-09 08:51 +1000 |
| Message-ID | <mailman.4408.1373323885.3114.python-list@python.org> |
| In reply to | #50165 |
On Tue, Jul 9, 2013 at 8:45 AM, Dave Angel <davea@davea.name> wrote: > On 07/08/2013 05:49 PM, Chris Angelico wrote: >> >> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote: >>> >>> But Unicode has nothing to do with Guido, and it has existed for about 25 >>> years (if I recall correctly). >> >> >> Depends how you measure. According to [1], the work kinda began back >> then (25 years ago being 1988), but it wasn't till 1991/92 that the >> spec was published. Also, the full Unicode range with multiple planes >> came about in 1996, with Unicode 2.0, so that could also be considered >> the beginning of Unicode. But that still means it's nearly old enough >> to drink, so programmers ought to be aware of it. >> > > Well, then I'm glad I stuck the qualifier on it. I remember where I was > working, and that company folded in 1992. I was working on NT long before > its official release in 1993, and it used Unicode, even if the spec was > sliding along. I'm sure I got unofficial versions of things through > Microsoft, at the time. No doubt! Of course, this list is good at dealing with the hard facts and making sure the archives are accurate, but that doesn't change your memory. Anyway, your fundamental point isn't materially affected by whether Unicode is 17 or 25 years old. It's been around plenty long enough by now, we should use it. Same with IPv6, too... ChrisA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-07-09 00:32 +0100 |
| Message-ID | <mailman.4409.1373326299.3114.python-list@python.org> |
| In reply to | #50165 |
On 08/07/2013 23:02, Joshua Landau wrote:
> On 8 July 2013 22:38, MRAB <python@mrabarnett.plus.com> wrote:
>> On 08/07/2013 21:56, Dave Angel wrote:
>>> Characters do not have a width.
>>
>> [snip]
>>
>> It depends what you mean by "width"! :-)
>>
>> Try this (Python 3):
>>
>>>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
>> AA
>
> Serious question: How would one find the width of a character by that
> definition?
>
>>> import unicodedata
>>> unicodedata.east_asian_width("A")
'Na'
>>> unicodedata.east_asian_width("\N{FULLWIDTH LATIN CAPITAL LETTER A}")
'F'
The possible widths are:
N = Neutral
A = Ambiguous
H = Halfwidth
W = Wide
F = Fullwidth
Na = Narrow
All you then need to do is find out what those actually mean...
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web