Groups > comp.lang.python > #50110 > unrolled thread

hex dump w/ or w/out utf-8 chars

Started by	blatt <ferdy.blatsco@gmail.com>
First post	2013-07-07 17:22 -0700
Last post	2013-07-13 04:51 +0000
Articles	20 on this page of 49 — 15 participants

Back to article view | Back to comp.lang.python

  hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
    Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
    Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
          Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
              Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
              Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
                  Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
                  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
                  Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
                  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
                    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
                    Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
                      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
                        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
                          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
                            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
                              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
                      Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
    Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
        Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
          Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
            Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
              Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
              Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
      Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
        Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
      Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
      Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
        Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
          Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
          Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
            Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

#50634

From	wxjmfauth@gmail.com
Date	2013-07-14 01:20 -0700
Message-ID	<69df4d48-4cb8-4102-b80c-247f8fd07f65@googlegroups.com>
In reply to	#50611

Le samedi 13 juillet 2013 21:02:24 UTC+2, Dave Angel a écrit :
> On 07/13/2013 10:37 AM, wxjmfauth@gmail.com wrote:
> 
> 
> 
> 
> 
> Fortunately for us, Python (in version 3.3 and later) and Pike did it 
> 
> right.  Some day the others may decide to do similarly.
> 
> 
> 

-----------
Possible but I doubt.
For a very simple reason, the latin-1 block: considered
and accepted today as beeing a Unicode design mistake.

jmf

[toc] | [prev] | [next] | [standalone]

#50638

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-14 10:44 +0000
Message-ID	<51e280fc$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to	#50634

On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:

> For a very simple reason, the latin-1 block: considered and accepted
> today as beeing a Unicode design mistake.

Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational 
Character Set", which goes back to 1983. ISO-8859-1 was first published 
in 1985, and was in use on Commodore computers the same year.

The concept of Unicode wasn't even started until 1987, and the first 
draft wasn't published until the end of 1990. Unicode wasn't considered 
ready for production use until 1991, six years after Latin-1 was already 
in use in people's computers.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#50640

From	wxjmfauth@gmail.com
Date	2013-07-14 06:44 -0700
Message-ID	<ea260eab-4361-4378-b61b-d33224d2ff5d@googlegroups.com>
In reply to	#50638

Le dimanche 14 juillet 2013 12:44:12 UTC+2, Steven D'Aprano a écrit :
> On Sun, 14 Jul 2013 01:20:33 -0700, wxjmfauth wrote:
> 
> 
> 
> > For a very simple reason, the latin-1 block: considered and accepted
> 
> > today as beeing a Unicode design mistake.
> 
> 
> 
> Latin-1 (also known as ISO-8859-1) was based on DEC's "Multinational 
> 
> Character Set", which goes back to 1983. ISO-8859-1 was first published 
> 
> in 1985, and was in use on Commodore computers the same year.
> 
> 
> 
> The concept of Unicode wasn't even started until 1987, and the first 
> 
> draft wasn't published until the end of 1990. Unicode wasn't considered 
> 
> ready for production use until 1991, six years after Latin-1 was already 
> 
> in use in people's computers.
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

------

"Unicode" (in fact iso-14xxx) was not created in one
night (Deus ex machina).

What's count today is this:

>>> timeit.repeat("a = 'hundred'; 'x' in a")
[0.11785943134991479, 0.09850454944486256, 0.09761604599423179]
>>> timeit.repeat("a = 'hundreœ'; 'x' in a")
[0.23955250303158593, 0.2195812612416752, 0.22133896997401692]
>>> 
>>> 
>>> sys.getsizeof('d')
26
>>> sys.getsizeof('œ')
40
>>> sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'

jmf

[toc] | [prev] | [next] | [standalone]

#51130

From	wxjmfauth@gmail.com
Date	2013-07-24 06:28 -0700
Message-ID	<696caa4f-142a-4e46-88fc-090da94ced2e@googlegroups.com>
In reply to	#50640

I do not find the thread, where a Python core dev spoke
about French, so I'm putting here.

This stupid Flexible String Representation splits Unicode
in chunks and one of these chunks is latin-1 (iso-8859-1).

If we consider that latin-1 is unusable for 17 (seventeen)
European languages based on the latin alphabet, one can not
say Python is really well prepared.

Most of the problems are coming from the extensive usage of
diacritics in these languages. Thanks to the FSR again,
working with normalized forms does not work very well. At
least, there is some consistency.

Now, if we consider that most of the new characters will
be part of the BMP ("daily" used chars), it is hard to
present Python as a modern language. It sticks more
to the past and it not really prepared for the future,
the acceptance of new chars like ẞ or the new Turkish lira
sign ((U+20BA).

>>> sys.getsizeof('š')
40
>>> sys.getsizeof('0')
26

14 bytes to encode a non-latin-1 char is not so bad.


jmf

[toc] | [prev] | [next] | [standalone]

#50618

From	Neil Hodgson <nhodgson@iinet.net.au>
Date	2013-07-14 09:17 +1000
Message-ID	<KP6dnXNvYYEDfXzMnZ2dnUVZ_qCdnZ2d@westnet.com.au>
In reply to	#50596

wxjmfauth@gmail.com:

> The FSR is naive and badly working. I can not force people
> to understand the coding of the characters [*].

    You could at least *try*.

    If there really was a problem with the FSR and you truly understood 
this problem then surely you would be able to communicate the problem to 
at least one person on the list.

    Neil

[toc] | [prev] | [next] | [standalone]

#50165

From	ferdy.blatsco@gmail.com
Date	2013-07-08 10:53 -0700
Message-ID	<7b6fc645-8bf3-4681-821c-38fb1fa1d191@googlegroups.com>
In reply to	#50110

Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:
>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.
I should have preferred another solution... but i'm not Guido....!

I said:
> in the first version the utf-8 conversion to hex was shown horizontally
And you replied:
>> Oh! We're supposed to read the output *downwards*! 
You are correct, but I was only referring to "special characters"...
My main concern was compactness of output and besides that every group of
bytes used for defining "special characters" is well represented with high
nibble in the range outside ascii 0-127.

Your following observations are connected more or less to the above point
and sorry if the interpretation of output... sucks!
I think that, for the interested user, all the question is of minor
importance.

Only another point is relevant for me:
>> The loop variable just gets reset once it reaches the top of the loop
>> again.
Apart your kind observation (... "hideously ugly to read") referring to
my code snippet incrementing the loop variable... you are correct.
I will never make the same mistake!

Bye, Blatt.

[toc] | [prev] | [next] | [standalone]

#50167

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-09 04:07 +1000
Message-ID	<mailman.4393.1373306845.3114.python-list@python.org>
In reply to	#50165

On Tue, Jul 9, 2013 at 3:53 AM,  <ferdy.blatsco@gmail.com> wrote:
>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
> Not using python 3, for me (a programmer which was present at the beginning of
> computer science, badly interacting with many languages from assembler to
> Fortran and from c to Pascal and so on) it was an hard job to arrange the
> abrupt transition from characters only equal to bytes to some special
> characters defined with 2, 3 bytes and even more.

Even back then, bytes and characters were different. 'A' is a
character, 0x41 is a byte. And they correspond 1:1 if and only if you
know that your characters are represented in ASCII. Other encodings
(eg EBCDIC) mapped things differently. The only difference now is that
more people are becoming aware that there are more than 256 characters
in the world.

Like Magic 2014 and its treatment of Slivers, at some point you're
going to have to master the difference between bytes and characters,
or else be eternally hacking around stuff in your code, so now is as
good a time as any.

ChrisA

[toc] | [prev] | [next] | [standalone]

#50171

From	Dave Angel <davea@davea.name>
Date	2013-07-08 16:56 -0400
Message-ID	<mailman.4397.1373317033.3114.python-list@python.org>
In reply to	#50165

On 07/08/2013 01:53 PM, ferdy.blatsco@gmail.com wrote:
> Hi Steven,
>
> thank you for your reply... I really needed another python guru which
> is also an English teacher! Sorry if English is not my mother tongue...
> "uncorrect" instead of "incorrect" (I misapplied the "similarity
> principle" like "unpleasant...>...uncorrect").
>
> Apart from these trifles, you said:
>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
> Not using python 3, for me (a programmer which was present at the beginning of
> computer science, badly interacting with many languages from assembler to
> Fortran and from c to Pascal and so on) it was an hard job to arrange the
> abrupt transition from characters only equal to bytes to some special
> characters defined with 2, 3 bytes and even more.

Characters do not have a width.  They are Unicode code points, an 
abstraction.  It's only when you encode them in byte strings that a code 
point takes on any specific width.  And some encodings go to one-byte 
strings (and get errors for characters that don't match), some go to 
two-bytes each, some variable, etc.

> I should have preferred another solution... but i'm not Guido....!

But Unicode has nothing to do with Guido, and it has existed for about 
25 years (if I recall correctly).  It's only that Python 3 is finally 
embracing it, and making it the default type for characters, as it 
should be.  As far as I'm concerned, the only reason it shouldn't have 
been done long ago was that programs were trying to fit on 640k DOS 
machines.  Even before Unicode, there were multi-byte encodings around 
(eg. Microsoft's MBCS), and each was thoroughly incompatible with all 
the others.  And the problem with one-byte encodings is that if you need 
to use a Greek currency symbol in a document that's mostly Norwegian (or 
some such combination of characters), there might not be ANY valid way 
to do it within a single "character set."

Python 2 supports all the same Unicode features as 3;  it's just that it 
defaults to byte strings.  So it's HARDER to get it right.

Except for special purpose programs like a file dumper, it's usually 
unnecessary for a Python 3 programmer to deal with individual bytes from 
a byte string.  Text files are a bunch of bytes, and somebody has to 
interpret them as characters.  If you let open() handle it, and if you 
give it the correct encoding, it just works.  Internally, all strings 
are Unicode, and you don't care where they came from, or what human 
language they may have characters from.  You can combine strings from 
multiple places, without much worry that they might interfere.

Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS) 
from the beginning (approx 1992), and has had Unicode versions of each 
of its API's for nearly as long.

I appreciate you've been around a long time, and worked in a lot of 
languages.  I've programmed professionally in at least 35 languages 
since 1967.  But we've come a long way from the 6bit characters I used 
in 1968.  At that time, we packed them 10 characters to each word.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#50237

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-07-09 12:22 +0000
Message-ID	<b42dk9F56csU3@mid.individual.net>
In reply to	#50171

On 2013-07-08, Dave Angel <davea@davea.name> wrote:
> I appreciate you've been around a long time, and worked in a
> lot of languages.  I've programmed professionally in at least
> 35 languages since 1967.  But we've come a long way from the
> 6bit characters I used in 1968.  At that time, we packed them
> 10 characters to each word.

One of the first Python project I undertook was a program to dump
the ZSCII strings from Infocom game files. They are mostly packed
one character per 5 bits, with escapes to (I had to recheck the
Z-machine spec) latin-1. Oh, those clever implementors: thwarting
hexdumping cheaters and cramming their games onto microcomputers
with one blow.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#50238

From	Dave Angel <davea@davea.name>
Date	2013-07-09 08:54 -0400
Message-ID	<mailman.4447.1373374482.3114.python-list@python.org>
In reply to	#50237

On 07/09/2013 08:22 AM, Neil Cerutti wrote:
> On 2013-07-08, Dave Angel <davea@davea.name> wrote:
>> I appreciate you've been around a long time, and worked in a
>> lot of languages.  I've programmed professionally in at least
>> 35 languages since 1967.  But we've come a long way from the
>> 6bit characters I used in 1968.  At that time, we packed them
>> 10 characters to each word.
>
> One of the first Python project I undertook was a program to dump
> the ZSCII strings from Infocom game files. They are mostly packed
> one character per 5 bits, with escapes to (I had to recheck the
> Z-machine spec) latin-1. Oh, those clever implementors: thwarting
> hexdumping cheaters and cramming their games onto microcomputers
> with one blow.
>

In 1973 I played with encoding some data that came over the public 
airwaves (I never learned the specific radio technology, probably used 
sidebands of FM stations). The data was encoded, with most characters 
taking 5 bits, and the decoded stream was like a ticker-tape.  With some 
hardware and the right software, you could track Wall Street in real 
time.  (Or maybe it had the usual 15 minute delay).

Obviously, they didn't publish the spec any place. But some others had 
the beginnings of a decoder, and I expanded on that.  We never did 
anything with it, it was just an interesting challenge.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#50239

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-07-09 13:00 +0000
Message-ID	<b42frvF5qscU1@mid.individual.net>
In reply to	#50238

On 2013-07-09, Dave Angel <davea@davea.name> wrote:
>> One of the first Python project I undertook was a program to
>> dump the ZSCII strings from Infocom game files. They are
>> mostly packed one character per 5 bits, with escapes to (I had
>> to recheck the Z-machine spec) latin-1. Oh, those clever
>> implementors: thwarting hexdumping cheaters and cramming their
>> games onto microcomputers with one blow.
>
> In 1973 I played with encoding some data that came over the
> public airwaves (I never learned the specific radio technology,
> probably used sidebands of FM stations). The data was encoded,
> with most characters taking 5 bits, and the decoded stream was
> like a ticker-tape.  With some hardware and the right software,
> you could track Wall Street in real time.  (Or maybe it had the
> usual 15 minute delay).
>
> Obviously, they didn't publish the spec any place. But some
> others had the beginnings of a decoder, and I expanded on that.
> We never did anything with it, it was just an interesting
> challenge.

Interestingly similar scheme. It wonder if 5-bit chars was a
common compression scheme. The Z-machine spec was never
officially published either. I believe a "task force" reverse
engineered it sometime in the 90's.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#50240

From	Skip Montanaro <skip@pobox.com>
Date	2013-07-09 08:18 -0500
Message-ID	<mailman.4448.1373375918.3114.python-list@python.org>
In reply to	#50239

> It wonder if 5-bit chars was a
> common compression scheme.

http://en.wikipedia.org/wiki/List_of_binary_codes

Baudot was pretty common, as I recall, though ASCII and EBCDIC ruled
by the time I started punching cards.

Skip

[toc] | [prev] | [next] | [standalone]

#50242

From	Dave Angel <davea@davea.name>
Date	2013-07-09 09:23 -0400
Message-ID	<mailman.4449.1373376257.3114.python-list@python.org>
In reply to	#50239

On 07/09/2013 09:00 AM, Neil Cerutti wrote:

    <SNIP>
> Interestingly similar scheme. It wonder if 5-bit chars was a
> common compression scheme. The Z-machine spec was never
> officially published either. I believe a "task force" reverse
> engineered it sometime in the 90's.
>

Baudot was 5 bits.  It used shift-codes to get upper case and digits, if 
I recall.

And ASCII was 7 bits so there could be one more for parity.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#50178

From	MRAB <python@mrabarnett.plus.com>
Date	2013-07-08 22:38 +0100
Message-ID	<mailman.4404.1373319468.3114.python-list@python.org>
In reply to	#50165

On 08/07/2013 21:56, Dave Angel wrote:
> On 07/08/2013 01:53 PM, ferdy.blatsco@gmail.com wrote:
>> Hi Steven,
>>
>> thank you for your reply... I really needed another python guru which
>> is also an English teacher! Sorry if English is not my mother tongue...
>> "uncorrect" instead of "incorrect" (I misapplied the "similarity
>> principle" like "unpleasant...>...uncorrect").
>>
>> Apart from these trifles, you said:
>>>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
>> Not using python 3, for me (a programmer which was present at the beginning of
>> computer science, badly interacting with many languages from assembler to
>> Fortran and from c to Pascal and so on) it was an hard job to arrange the
>> abrupt transition from characters only equal to bytes to some special
>> characters defined with 2, 3 bytes and even more.
>
> Characters do not have a width.
[snip]

It depends what you mean by "width"! :-)

Try this (Python 3):

 >>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
AＡ

[toc] | [prev] | [next] | [standalone]

#50179

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-09 07:49 +1000
Message-ID	<mailman.4405.1373320188.3114.python-list@python.org>
In reply to	#50165

On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote:
> But Unicode has nothing to do with Guido, and it has existed for about 25
> years (if I recall correctly).

Depends how you measure. According to [1], the work kinda began back
then (25 years ago being 1988), but it wasn't till 1991/92 that the
spec was published. Also, the full Unicode range with multiple planes
came about in 1996, with Unicode 2.0, so that could also be considered
the beginning of Unicode. But that still means it's nearly old enough
to drink, so programmers ought to be aware of it.

[1] http://en.wikipedia.org/wiki/Unicode#History

ChrisA

[toc] | [prev] | [next] | [standalone]

#50213

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-09 06:53 +0000
Message-ID	<51dbb372$0$6512$c3e8da3$5496439d@news.astraweb.com>
In reply to	#50179

On Tue, 09 Jul 2013 07:49:45 +1000, Chris Angelico wrote:

> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote:
>> But Unicode has nothing to do with Guido, and it has existed for about
>> 25 years (if I recall correctly).
> 
> Depends how you measure. According to [1], the work kinda began back
> then (25 years ago being 1988), but it wasn't till 1991/92 that the spec
> was published. Also, the full Unicode range with multiple planes came
> about in 1996, with Unicode 2.0, so that could also be considered the
> beginning of Unicode. But that still means it's nearly old enough to
> drink, so programmers ought to be aware of it.

Yes, yes, a thousand times yes. It's really not that hard to get the 
basics of Unicode.

"When I discovered that the popular web development tool PHP has almost 
complete ignorance of character encoding issues, blithely using 8 bits 
for characters, making it darn near impossible to develop good 
international web applications, I thought, enough is enough.

So I have an announcement to make: if you are a programmer working in 
2003 and you don't know the basics of characters, character sets, 
encodings, and Unicode, and I catch you, I'm going to punish you by 
making you peel onions for 6 months in a submarine. I swear I will."

http://www.joelonsoftware.com/articles/Unicode.html

Also: http://nedbatchelder.com/text/unipain.html

To start with, if you're writing code for Python 2.x, and not using u'' 
for strings, then you're making a rod for your own back. Do yourself a 
favour and get into the habit of always using u'' strings in Python 2.

I'll-start-taking-my-own-advice-next-week-I-promise-ly yrs,

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#50180

From	Joshua Landau <joshua.landau.ws@gmail.com>
Date	2013-07-08 23:02 +0100
Message-ID	<mailman.4406.1373321026.3114.python-list@python.org>
In reply to	#50165

On 8 July 2013 22:38, MRAB <python@mrabarnett.plus.com> wrote:
> On 08/07/2013 21:56, Dave Angel wrote:
>> Characters do not have a width.
>
> [snip]
>
> It depends what you mean by "width"! :-)
>
> Try this (Python 3):
>
>>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
> AＡ

Serious question: How would one find the width of a character by that
definition?

[toc] | [prev] | [next] | [standalone]

#50182

From	Dave Angel <davea@davea.name>
Date	2013-07-08 18:45 -0400
Message-ID	<mailman.4407.1373323563.3114.python-list@python.org>
In reply to	#50165

On 07/08/2013 05:49 PM, Chris Angelico wrote:
> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote:
>> But Unicode has nothing to do with Guido, and it has existed for about 25
>> years (if I recall correctly).
>
> Depends how you measure. According to [1], the work kinda began back
> then (25 years ago being 1988), but it wasn't till 1991/92 that the
> spec was published. Also, the full Unicode range with multiple planes
> came about in 1996, with Unicode 2.0, so that could also be considered
> the beginning of Unicode. But that still means it's nearly old enough
> to drink, so programmers ought to be aware of it.
>

Well, then I'm glad I stuck the qualifier on it.  I remember where I was 
working, and that company folded in 1992.  I was working on NT long 
before its official release in 1993, and it used Unicode, even if the 
spec was sliding along.  I'm sure I got unofficial versions of things 
through Microsoft, at the time.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#50183

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-09 08:51 +1000
Message-ID	<mailman.4408.1373323885.3114.python-list@python.org>
In reply to	#50165

On Tue, Jul 9, 2013 at 8:45 AM, Dave Angel <davea@davea.name> wrote:
> On 07/08/2013 05:49 PM, Chris Angelico wrote:
>>
>> On Tue, Jul 9, 2013 at 6:56 AM, Dave Angel <davea@davea.name> wrote:
>>>
>>> But Unicode has nothing to do with Guido, and it has existed for about 25
>>> years (if I recall correctly).
>>
>>
>> Depends how you measure. According to [1], the work kinda began back
>> then (25 years ago being 1988), but it wasn't till 1991/92 that the
>> spec was published. Also, the full Unicode range with multiple planes
>> came about in 1996, with Unicode 2.0, so that could also be considered
>> the beginning of Unicode. But that still means it's nearly old enough
>> to drink, so programmers ought to be aware of it.
>>
>
> Well, then I'm glad I stuck the qualifier on it.  I remember where I was
> working, and that company folded in 1992.  I was working on NT long before
> its official release in 1993, and it used Unicode, even if the spec was
> sliding along.  I'm sure I got unofficial versions of things through
> Microsoft, at the time.

No doubt! Of course, this list is good at dealing with the hard facts
and making sure the archives are accurate, but that doesn't change
your memory.

Anyway, your fundamental point isn't materially affected by whether
Unicode is 17 or 25 years old. It's been around plenty long enough by
now, we should use it. Same with IPv6, too...

ChrisA

[toc] | [prev] | [next] | [standalone]

#50184

From	MRAB <python@mrabarnett.plus.com>
Date	2013-07-09 00:32 +0100
Message-ID	<mailman.4409.1373326299.3114.python-list@python.org>
In reply to	#50165

On 08/07/2013 23:02, Joshua Landau wrote:
> On 8 July 2013 22:38, MRAB <python@mrabarnett.plus.com> wrote:
>> On 08/07/2013 21:56, Dave Angel wrote:
>>> Characters do not have a width.
>>
>> [snip]
>>
>> It depends what you mean by "width"! :-)
>>
>> Try this (Python 3):
>>
>>>>> print("A\N{FULLWIDTH LATIN CAPITAL LETTER A}")
>> AＡ
>
> Serious question: How would one find the width of a character by that
> definition?
>
 >>> import unicodedata
 >>> unicodedata.east_asian_width("A")
'Na'
 >>> unicodedata.east_asian_width("\N{FULLWIDTH LATIN CAPITAL LETTER A}")
'F'

The possible widths are:

     N  = Neutral
     A  = Ambiguous
     H  = Halfwidth
     W  = Wide
     F  = Fullwidth
     Na = Narrow

All you then need to do is find out what those actually mean...

[toc] | [prev] | [next] | [standalone]

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

csiph-web

hex dump w/ or w/out utf-8 chars

Contents

#50634

#50638

#50640

#51130

#50618

#50165

#50167

#50171

#50237

#50238

#50239

#50240

#50242

#50178

#50179

#50213

#50180

#50182

#50183

#50184