Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50110 > unrolled thread

hex dump w/ or w/out utf-8 chars

Started byblatt <ferdy.blatsco@gmail.com>
First post2013-07-07 17:22 -0700
Last post2013-07-13 04:51 +0000
Articles 20 on this page of 49 — 15 participants

Back to article view | Back to comp.lang.python


Contents

  hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
    Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
    Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
          Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
              Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
              Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
                  Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
                  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
                  Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
                  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
                    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
                    Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
                      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
                        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
                          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
                            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
                              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
                      Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
    Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
        Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
          Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
            Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
              Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
              Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
      Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
        Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
      Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
      Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
      Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
      Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
        Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
        Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
          Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
          Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
            Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000

Page 1 of 3  [1] 2 3  Next page →


#50110 — hex dump w/ or w/out utf-8 chars

Fromblatt <ferdy.blatsco@gmail.com>
Date2013-07-07 17:22 -0700
Subjecthex dump w/ or w/out utf-8 chars
Message-ID<a35609c1-e56f-4180-8176-4405264da0a2@googlegroups.com>
Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on
hex dump in presence of utf-8 chars.
If you are not using python 3, the utf-8 codec can add further programming
problems, especially if you are not a guru....
The script seems very long but I commented too much ... sorry.
It is very useful (at least IMHO...)
It works under Linux. but there is still a little problem which I didn't
solve (at least programmatically...).


# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py)   # python 2.6.6
# hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny)  uncorrect  indentations.

# to save output > python pxb.py hex.txt > px9_out_hex.txt

nLenN=3          # n. of digits for lines

# version almost thoroughly rewritten on the ground of
# the critics and modifications suggested by Chris Angelico

# in the first version the utf-8 conversion to hex was shown horizontaly:

# 005 # qwerty: non è unicode bensì ascii
#     2 7767773 666 ca 7666666 6667ca 676660
#     3 175249a efe 38 5e93f45 25e33c 13399a

# ... but I had to insert additional chars to keep the
#     synchronization between the literal and the hex part

# 005 # qwerty: non è. unicode bensì. ascii
#     2 7767773 666 ca 7666666 6667ca 676660
#     3 175249a efe 38 5e93f45 25e33c 13399a

# in the second version I followed Chris suggestion:
# "to show the hex utf-8 vertically"

# 005 # qwerty: non è unicode bensì ascii
#     2 7767773 666 c 7666666 6667c 676660
#     3 175249a efe 3 5e93f45 25e33 13399a
#                   a             a
#                   8             c

# between the two solutions, I selected the first one + syncronization,
#     which seems more compact and easier to program (... I'm lazy...)

# various run options:
# std      :             python px.py file
# bash cat : cat  file | python px.py (alias hex)
# bash echo: echo line | python px.py    "    "

# works on any n. of bytes for utf-8

# For the user: it is helpful to have in a separate file
# all special characters of interest, together with their names.

# error:

# echo '345"789"'|hex    > 345"789"              345"789"
#                          33323332  instead of  333233320
#                          3452789 a    "    "   34527892a

# ... correction: avoiding "\n at end of test-line
# echo "345'789'"|hex   >  345'789'
#                          333233320
#                          34577897a

# same error in every run option

# If someone can solve this bug...

###################


import fileinput
import sys, commands

lF=[]                           # input file as list
for line in fileinput.input():  # handles all the details of args-or-stdin
    lF.append(line)
sSpacesXLN = ' ' * (nLenN+1)


for n in xrange(len(lF)):
    sLineHexND=lF[n].encode('hex')     # ND = no delimiter (space)
    sLineHex  =lF[n].encode('hex').replace('20','  ')
    sLineHexH =sLineHex[::2]
    sLineHexL =sLineHex[1::2]

    sSynchro=''
    for k in xrange(0,len(sLineHexND),2):
        if sLineHexND[k]<'8':
            sSynchro+= sLineHexND[k]+sLineHexND[k+1]
            k+=1
        elif sLineHexND[k]=='c':
            sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+'2e'
            k+=3
        elif sLineHexND[k]=='e':
            sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+\
                          sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
            k+=5

    # text output (synchroinized)
    print str(n+1).zfill(nLenN)+' '+sSynchro.decode('hex'),
    print sSpacesXLN + sLineHexH
    print sSpacesXLN + sLineHexL+ '\n'


If there are problems of understanding, probably due to fonts, the best
thing is import it in an editor with "mono" fonts...

As I already told to Chris... critics are welcome!

Bye, Blatt.









[toc] | [next] | [standalone]


#50113

FromChris Angelico <rosuav@gmail.com>
Date2013-07-08 11:17 +1000
Message-ID<mailman.4362.1373246245.3114.python-list@python.org>
In reply to#50110
On Mon, Jul 8, 2013 at 10:22 AM, blatt <ferdy.blatsco@gmail.com> wrote:
> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on
> hex dump in presence of utf-8 chars.

Hiya! Glad to have been of assistance :)

> As I already told to Chris... critics are welcome!

No problem.

> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)   # python 2.6.6
> # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows
> # (better than tabnanny)  uncorrect  indentations.
>
> # to save output > python pxb.py hex.txt > px9_out_hex.txt
>
> nLenN=3          # n. of digits for lines
>
> # chomp heaps and heaps of comments

Little nitpick, since you did invite criticism :) When I went to copy
and paste your code, I skipped all the comments and started at the
line of hashes... and then didn't have the nLenN definition. Posting
code to a forum like this is a huge invitation to try the code (it's
the very easiest way to know what it does), so I would recommend
having all your comments at the top, and all the code in a block
underneath. It'd be that bit easier for us to help you. Not a big
deal, though, I did figure out what was going on :)

>     sLineHex  =lF[n].encode('hex').replace('20','  ')

Here's the problem. Your hex string ends with "220a", and the
replace() method doesn't concern itself with the divisions between
bytes. It finds the second 2 of 22 and the leading 0 of 0a and
replaces them.

I think the best solution may be to avoid the .encode('hex') part,
since it's not available in Python 3 anyway. Alternatively (if Py3
migration isn't a concern), you could do something like this:

    sLineHexND=lF[n].encode('hex')     # ND = no delimiter (space)
    sLineHex  =sLineHexND # No reason to redo the encoding
    twentypos=0
    while True:
        twentypos=sLineHex.find("20",twentypos)
        if twentypos==-1: break # We've reached the end of the string
        if not twentypos%2: # It's at an even-numbered position, replace it
            sLineHex=sLineHex[:twentypos]+'  '+sLineHex[twentypos+2:]
        twentypos+=1
    # then continue on as before

>     sLineHexH =sLineHex[::2]
>     sLineHexL =sLineHex[1::2]
> [ code continues ]

Hope that helps!

ChrisA

[toc] | [prev] | [next] | [standalone]


#50120

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-08 05:48 +0000
Message-ID<51da52c6$0$6512$c3e8da3$5496439d@news.astraweb.com>
In reply to#50110
On Sun, 07 Jul 2013 17:22:26 -0700, blatt wrote:

> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on hex
> dump in presence of utf-8 chars.

I don't understand what you are trying to say. All characters are UTF-8 
characters. "a" is a UTF-8 character. So is "ă".


> If you are not using python 3, the utf-8 codec can add further
> programming problems, 

On the contrary, I find that so long as you understand what you are doing 
it solves problems, not adds them. However, if you are confused about the 
difference between characters (text strings) and bytes, or if you are 
dealing with arbitrary binary data and trying to treat it as if it were 
UTF-8 encoded text, then you can have errors. Those errors are a good 
thing.


> especially if you are not a guru.... The script
> seems very long but I commented too much ... sorry. It is very useful
> (at least IMHO...)
> It works under Linux. but there is still a little problem which I didn't
> solve (at least programmatically...).
> 
> 
> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)   
> # python 2.6.6 # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows 
> # (better than tabnanny) uncorrect  indentations.

The word you are looking for is "incorrect".


> # to save output > python pxb.py hex.txt > px9_out_hex.txt
> 
> nLenN=3          # n. of digits for lines
> 
> # version almost thoroughly rewritten on the ground of 
> # the critics and modifications suggested by Chris Angelico
> 
> # in the first version the utf-8 conversion to hex was shown
> horizontaly:
> 
> # 005 # qwerty: non è unicode bensì ascii 
> #     2 7767773 666 ca 7666666 6667ca 676660
> #     3 175249a efe 38 5e93f45 25e33c 13399a

Oh! We're supposed to read the output *downwards*! That's not very 
intuitive. It took me a while to work that out. You should at least say 
so.


> # ... but I had to insert additional chars to keep the
> # synchronization between the literal and the hex part
> 
> # 005 # qwerty: non è. unicode bensì. ascii 
> #     2 7767773 666 ca 7666666 6667ca 676660
> #     3 175249a efe 38 5e93f45 25e33c 13399a

Well that sucks, because now sometimes you have to read downwards 
(character 'q' -> hex 71, reading downwards) and sometimes you read both 
downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a 
dot and sometimes it means filler. How is the user supposed to know when 
to read down and when across?

 
> # in the second version I followed Chris suggestion:
> # "to show the hex utf-8 vertically"

You're already showing UTF-8 characters vertically, if they happen to be 
a one-byte character. Better to be consistent and always show characters 
vertical, regardless of whether they are one, two or four bytes.


> # 005 # qwerty: non è unicode bensì ascii
> #     2 7767773 666 c 7666666 6667c 676660
> #     3 175249a efe 3 5e93f45 25e33 13399a 
> #                   a             a
> #                   8             c

Much better! Now at least you can trivially read down the column to see 
the bytes used for each character. As an alternative, you can space each 
character to show the bytes horizontally, displaying spaces and other 
invisible characters either as dots, backslash escapes, or Unicode 
control pictures, whichever you prefer. The example below uses dots for 
spaces and backslash escape for newline:

q  w  e  r  t  y  :  .  n  o  n  .  è     .  u  n  i  
71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69

c  o  d  e  .  b  e  n  s  ì     .  a  s  c  i  i  \n
63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a


There will always be some ambiguity between (e.g.) dot representing a 
dot, and it representing an invisible control character or space, but the 
reader can always tell them apart by reading the hex value, which you 
*always* read horizontally whether it is one byte, two or four. There's 
never any confusion whether you should read down or across.

Unfortunately, most fonts don't support the Unicode control pictures. But 
if you choose to use them, here they are, together with their Unicode 
name. You can use the form

'\N{...}'  # Python 3
u'\N{...}'  # Python 2

to get the characters, replacing ... with the name shown below:


␀ SYMBOL FOR NULL
␁ SYMBOL FOR START OF HEADING
␂ SYMBOL FOR START OF TEXT
␃ SYMBOL FOR END OF TEXT
␄ SYMBOL FOR END OF TRANSMISSION
␅ SYMBOL FOR ENQUIRY
␆ SYMBOL FOR ACKNOWLEDGE
␇ SYMBOL FOR BELL
␈ SYMBOL FOR BACKSPACE
␉ SYMBOL FOR HORIZONTAL TABULATION
␊ SYMBOL FOR LINE FEED
␋ SYMBOL FOR VERTICAL TABULATION
␌ SYMBOL FOR FORM FEED
␍ SYMBOL FOR CARRIAGE RETURN
␎ SYMBOL FOR SHIFT OUT
␏ SYMBOL FOR SHIFT IN
␐ SYMBOL FOR DATA LINK ESCAPE
␑ SYMBOL FOR DEVICE CONTROL ONE
␒ SYMBOL FOR DEVICE CONTROL TWO
␓ SYMBOL FOR DEVICE CONTROL THREE
␔ SYMBOL FOR DEVICE CONTROL FOUR
␕ SYMBOL FOR NEGATIVE ACKNOWLEDGE
␖ SYMBOL FOR SYNCHRONOUS IDLE
␗ SYMBOL FOR END OF TRANSMISSION BLOCK
␘ SYMBOL FOR CANCEL
␙ SYMBOL FOR END OF MEDIUM
␚ SYMBOL FOR SUBSTITUTE
␛ SYMBOL FOR ESCAPE
␜ SYMBOL FOR FILE SEPARATOR
␝ SYMBOL FOR GROUP SEPARATOR
␞ SYMBOL FOR RECORD SEPARATOR
␟ SYMBOL FOR UNIT SEPARATOR
␠ SYMBOL FOR SPACE
␡ SYMBOL FOR DELETE
␢ BLANK SYMBOL
␣ OPEN BOX
␤ SYMBOL FOR NEWLINE
␥ SYMBOL FOR DELETE FORM TWO
␦ SYMBOL FOR SUBSTITUTE FORM TWO


(I wish more fonts would support these characters, they are very useful.)


[...]
> # works on any n. of bytes for utf-8
> 
> # For the user: it is helpful to have in a separate file
> # all special characters of interest, together with their names.

In Python, you can use the unicodedata module to look up characters by 
name, or given the character, find out what it's name is.


[...]
> import fileinput
> import sys, commands
> 
> lF=[]                           # input file as list
> for line in fileinput.input():  # handles all the details of args-or-
stdin
>     lF.append(line)


That is more easily written as:

lF = list(fileinput.input())

and better written with a meaningful file name. Whenever you have a 
variable, and find the need to give a comment explaining what the 
variable name means, you should consider a more descriptive name.

When that name is a cryptic two letter name, that goes double.


> sSpacesXLN = ' ' * (nLenN+1)
> 
> 
> for n in xrange(len(lF)):
>     sLineHexND=lF[n].encode('hex')     # ND = no delimiter (space)

You're programming like a Pascal or C programmer. There is nearly never 
any need to write code like that in Python. Rather than iterate over the 
indexes, then extract the part you want, it is better to iterate directly 
over the parts you want:

for line in lF:
    sLineHexND = line.encode('hex')



>     sLineHex  =lF[n].encode('hex').replace('20','  ')
>     sLineHexH =sLineHex[::2]
>     sLineHexL =sLineHex[1::2]

Trying to keep code lined up in this way is a bad habit to get into. It 
just sets you up for many hours of unproductive adding and deleting 
spaces trying to keep things aligned.

Also, what on earth are all these "s" prefixes?

>     sSynchro=''
>     for k in xrange(0,len(sLineHexND),2):

Probably the best way to walk through a string, grabbing the characters 
in pairs, comes from the itertools module: see the recipe for "grouper".

http://docs.python.org/2/library/itertools.html

Here is a simplified version:

assert len(line) % 2 == 0
for pair in zip(*([iter(line)]*2)):
    ...

although understanding how it works requires a little advanced knowledge.


>         if sLineHexND[k]<'8':
>             sSynchro+= sLineHexND[k]+sLineHexND[k+1] 
>             k+=1
>         elif sLineHexND[k]=='c':
>             sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+'2e'
>             k+=3
>         elif sLineHexND[k]=='e':
>             sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+\
>                           sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
>             k+=5

Apart from being hideously ugly to read, I do not believe this code works 
the way you think it works. Adding to the loop variable doesn't advance 
the loop. Try this and see for yourself:


for i in range(10):
    print(i)
    i += 5


The loop variable just gets reset once it reaches the top of the loop 
again.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#50162

Fromferdy.blatsco@gmail.com
Date2013-07-08 10:31 -0700
Message-ID<7ef8c0e7-7f7c-4a22-89a9-50f62c4a8064@googlegroups.com>
In reply to#50110
Hi Chris,
glad to have received your contribution, but I was expecting much more
critics...
Starting from the "little nitpick" about the comment dispositon in my
script... you are correct... It is a bad habit on my part to place
variables subjected to change at the beginning of the script... and then
forget it...
About the mistake due to replace, you gave me a perfect explanation.
Unfortunately (as probably I told you before) I will never pass to
Python 3...  Guido should not always listen only to gurus like him...
I don't like Python as before...starting from OOP and ending with codecs
like utf-8. Regarding OOP, much appreciated expecially by experts, he
could use python 2 for hiding the complexities of OOP (improving, as an
effect, object's code hiding) moving classes and objects to
imported methods, leaving in this way the programming style to the
well known old style: sequential programming and functions.
About utf-8... the same solution: keep utf-8 but for the non experts, add
methods to convert to solutions which use the range 128-255 of only one
byte (I do not give a damn about chinese and "similia"!...)
I know that is a lost battle (in italian "una battaglia persa")!

Bye, Blatt

[toc] | [prev] | [next] | [standalone]


#50164

FromChris Angelico <rosuav@gmail.com>
Date2013-07-09 03:52 +1000
Message-ID<mailman.4391.1373305945.3114.python-list@python.org>
In reply to#50162
On Tue, Jul 9, 2013 at 3:31 AM,  <ferdy.blatsco@gmail.com> wrote:
> Unfortunately (as probably I told you before) I will never pass to
> Python 3...  Guido should not always listen only to gurus like him...
> I don't like Python as before...starting from OOP and ending with codecs
> like utf-8. Regarding OOP, much appreciated expecially by experts, he
> could use python 2 for hiding the complexities of OOP (improving, as an
> effect, object's code hiding) moving classes and objects to
> imported methods, leaving in this way the programming style to the
> well known old style: sequential programming and functions.
> About utf-8... the same solution: keep utf-8 but for the non experts, add
> methods to convert to solutions which use the range 128-255 of only one
> byte (I do not give a damn about chinese and "similia"!...)
> I know that is a lost battle (in italian "una battaglia persa")!

Well, there won't be a Python 2.8, so you really should consider
moving at some point. Python 3.3 is already way better than 2.7 in
many ways, 3.4 will improve on 3.3, and the future is pretty clear.
But nobody's forcing you, and 2.7.x will continue to get
bugfix/security releases for a while. (Personally, I'd be happy if
everyone moved off the 2.3/2.4 releases. It's not too hard supporting
2.6+ or 2.7+.)

The thing is, you're thinking about UTF-8, but you should be thinking
about Unicode. I recommend you read these articles:

http://www.joelonsoftware.com/articles/Unicode.html
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

So long as you are thinking about different groups of characters as
different, and wanting a solution that maps characters down into the
<256 range, you will never be able to cleanly internationalize. With
Python 3.3+, you can ignore the differences between ASCII, BMP, and
SMP characters; they're all just "characters". Everything works
perfectly with Unicode.

ChrisA

[toc] | [prev] | [next] | [standalone]


#50443

Fromwxjmfauth@gmail.com
Date2013-07-11 06:18 -0700
Message-ID<a3a4aa9b-3a5c-42cd-9a04-4c02f962b71e@googlegroups.com>
In reply to#50164
Le lundi 8 juillet 2013 19:52:17 UTC+2, Chris Angelico a écrit :
> On Tue, Jul 9, 2013 at 3:31 AM,  <ferdy.blatsco@gmail.com> wrote:
> 
> > Unfortunately (as probably I told you before) I will never pass to
> 
> > Python 3...  Guido should not always listen only to gurus like him...
> 
> > I don't like Python as before...starting from OOP and ending with codecs
> 
> > like utf-8. Regarding OOP, much appreciated expecially by experts, he
> 
> > could use python 2 for hiding the complexities of OOP (improving, as an
> 
> > effect, object's code hiding) moving classes and objects to
> 
> > imported methods, leaving in this way the programming style to the
> 
> > well known old style: sequential programming and functions.
> 
> > About utf-8... the same solution: keep utf-8 but for the non experts, add
> 
> > methods to convert to solutions which use the range 128-255 of only one
> 
> > byte (I do not give a damn about chinese and "similia"!...)
> 
> > I know that is a lost battle (in italian "una battaglia persa")!
> 
> 
> 
> Well, there won't be a Python 2.8, so you really should consider
> 
> moving at some point. Python 3.3 is already way better than 2.7 in
> 
> many ways, 3.4 will improve on 3.3, and the future is pretty clear.
> 
> But nobody's forcing you, and 2.7.x will continue to get
> 
> bugfix/security releases for a while. (Personally, I'd be happy if
> 
> everyone moved off the 2.3/2.4 releases. It's not too hard supporting
> 
> 2.6+ or 2.7+.)
> 
> 
> 
> The thing is, you're thinking about UTF-8, but you should be thinking
> 
> about Unicode. I recommend you read these articles:
> 
> 
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
> 
> 
> 
> So long as you are thinking about different groups of characters as
> 
> different, and wanting a solution that maps characters down into the
> 
> <256 range, you will never be able to cleanly internationalize. With
> 
> Python 3.3+, you can ignore the differences between ASCII, BMP, and
> 
> SMP characters; they're all just "characters". Everything works
> 
> perfectly with Unicode.
> 

-----------

Just to stick with this funny character ẞ, a ucs-2 char
in the Flexible String Representation nomenclature.

It seems to me that, when one needs more than ten bytes
to encode it, 

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ẞ')
40

this is far away from the perfection.

BTW, for a modern language, is not ucs2 considered
as obsolete since many, many years?

jmf


[toc] | [prev] | [next] | [standalone]


#50444

FromChris Angelico <rosuav@gmail.com>
Date2013-07-11 23:32 +1000
Message-ID<mailman.4585.1373549528.3114.python-list@python.org>
In reply to#50443
On Thu, Jul 11, 2013 at 11:18 PM,  <wxjmfauth@gmail.com> wrote:
> Just to stick with this funny character ẞ, a ucs-2 char
> in the Flexible String Representation nomenclature.
>
> It seems to me that, when one needs more than ten bytes
> to encode it,
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ẞ')
> 40
>
> this is far away from the perfection.

Better comparison is to see how much space is used by one copy of it,
and how much by two copies:

>>> sys.getsizeof('aa')-sys.getsizeof('a')
1
>>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
2

String objects have overhead. Big deal.

> BTW, for a modern language, is not ucs2 considered
> as obsolete since many, many years?

Clearly. And similarly, the 16-bit integer has been completely
obsoleted, as there is no reason anyone should ever bother to use it.
Same with the float type - everyone uses double or better these days,
right?

http://www.postgresql.org/docs/current/static/datatype-numeric.html
http://www.cplusplus.com/doc/tutorial/variables/

Nope, nobody uses small integers any more, they're clearly completely obsolete.

ChrisA

[toc] | [prev] | [next] | [standalone]


#50473

Fromwxjmfauth@gmail.com
Date2013-07-11 11:42 -0700
Message-ID<26d5c832-eaa1-439e-af61-e2855af2cd18@googlegroups.com>
In reply to#50444
Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 11, 2013 at 11:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > in the Flexible String Representation nomenclature.
> 
> >
> 
> > It seems to me that, when one needs more than ten bytes
> 
> > to encode it,
> 
> >
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('ẞ')
> 
> > 40
> 
> >
> 
> > this is far away from the perfection.
> 
> 
> 
> Better comparison is to see how much space is used by one copy of it,
> 
> and how much by two copies:
> 
> 
> 
> >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> 1
> 
> >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> 2
> 
> 
> 
> String objects have overhead. Big deal.
> 
> 
> 
> > BTW, for a modern language, is not ucs2 considered
> 
> > as obsolete since many, many years?
> 
> 
> 
> Clearly. And similarly, the 16-bit integer has been completely
> 
> obsoleted, as there is no reason anyone should ever bother to use it.
> 
> Same with the float type - everyone uses double or better these days,
> 
> right?
> 
> 
> 
> http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> http://www.cplusplus.com/doc/tutorial/variables/
> 
> 
> 
> Nope, nobody uses small integers any more, they're clearly completely obsolete.
> 
> 
> 

Sure there is some overhead because a str is a class.
It still remain that a "ẞ" weights 14 bytes more than
an "a".

In "aẞ", the ẞ weights 6 bytes.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('aẞ')
42

and in "aẞẞ", the ẞ weights 2 bytes

sys.getsizeof('aẞẞ')

And what to say about this "ucs4" char/string '\U0001d11e' which
is weighting 18 bytes more than an "a".

>>> sys.getsizeof('\U0001d11e')
44

A total absurdity. How does is come? Very simple, once you
split Unicode in subsets, not only you have to handle these
subsets, you have to create "markers" to differentiate them.
Not only, you produce "markers", you have to handle the
mess generated by these "markers". Hiding this markers
in the everhead of the class does not mean that they should
not be counted as part of the coding scheme. BTW, since
when a serious coding scheme need an extermal marker?



>>> sys.getsizeof('aa') - sys.getsizeof('a')
1

Shortly, if my algebra is still correct:

(overhead + marker + 2*'a') - (overhead + marker + 'a')
= (overhead + marker + 2*'a') - overhead - marker - 'a')
= overhead - overhead + marker - marker + 2*'a' - 'a'
= 0 + 0 + 'a'
= 1

The "marker" has magically disappeared.

jmf

[toc] | [prev] | [next] | [standalone]


#50474

Fromwxjmfauth@gmail.com
Date2013-07-11 11:44 -0700
Message-ID<d9ebccf8-050c-4dcc-ad51-ff50868a1287@googlegroups.com>
In reply to#50473
Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm...@gmail.com a écrit :
> Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> 
> > On Thu, Jul 11, 2013 at 11:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > 
> 
> > > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > 
> 
> > > in the Flexible String Representation nomenclature.
> 
> > 
> 
> > >
> 
> > 
> 
> > > It seems to me that, when one needs more than ten bytes
> 
> > 
> 
> > > to encode it,
> 
> > 
> 
> > >
> 
> > 
> 
> > >>>> sys.getsizeof('a')
> 
> > 
> 
> > > 26
> 
> > 
> 
> > >>>> sys.getsizeof('ẞ')
> 
> > 
> 
> > > 40
> 
> > 
> 
> > >
> 
> > 
> 
> > > this is far away from the perfection.
> 
> > 
> 
> > 
> 
> > 
> 
> > Better comparison is to see how much space is used by one copy of it,
> 
> > 
> 
> > and how much by two copies:
> 
> > 
> 
> > 
> 
> > 
> 
> > >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> > 
> 
> > 1
> 
> > 
> 
> > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> > 
> 
> > 2
> 
> > 
> 
> > 
> 
> > 
> 
> > String objects have overhead. Big deal.
> 
> > 
> 
> > 
> 
> > 
> 
> > > BTW, for a modern language, is not ucs2 considered
> 
> > 
> 
> > > as obsolete since many, many years?
> 
> > 
> 
> > 
> 
> > 
> 
> > Clearly. And similarly, the 16-bit integer has been completely
> 
> > 
> 
> > obsoleted, as there is no reason anyone should ever bother to use it.
> 
> > 
> 
> > Same with the float type - everyone uses double or better these days,
> 
> > 
> 
> > right?
> 
> > 
> 
> > 
> 
> > 
> 
> > http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> > 
> 
> > http://www.cplusplus.com/doc/tutorial/variables/
> 
> > 
> 
> > 
> 
> > 
> 
> > Nope, nobody uses small integers any more, they're clearly completely obsolete.
> 
> > 
> 
> > 
> 
> > 
> 
> 
> 
> Sure there is some overhead because a str is a class.
> 
> It still remain that a "ẞ" weights 14 bytes more than
> 
> an "a".
> 
> 
> 
> In "aẞ", the ẞ weights 6 bytes.
> 
> 
> 
> >>> sys.getsizeof('a')
> 
> 26
> 
> >>> sys.getsizeof('aẞ')
> 
> 42
> 
> 
> 
> and in "aẞẞ", the ẞ weights 2 bytes
> 
> 
> 
> sys.getsizeof('aẞẞ')
> 
> 
> 
> And what to say about this "ucs4" char/string '\U0001d11e' which
> 
> is weighting 18 bytes more than an "a".
> 
> 
> 
> >>> sys.getsizeof('\U0001d11e')
> 
> 44
> 
> 
> 
> A total absurdity. How does is come? Very simple, once you
> 
> split Unicode in subsets, not only you have to handle these
> 
> subsets, you have to create "markers" to differentiate them.
> 
> Not only, you produce "markers", you have to handle the
> 
> mess generated by these "markers". Hiding this markers
> 
> in the everhead of the class does not mean that they should
> 
> not be counted as part of the coding scheme. BTW, since
> 
> when a serious coding scheme need an extermal marker?
> 
> 
> 
> 
> 
> 
> 
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
> 
> 1
> 
> 
> 
> Shortly, if my algebra is still correct:
> 
> 
> 
> (overhead + marker + 2*'a') - (overhead + marker + 'a')
> 
> = (overhead + marker + 2*'a') - overhead - marker - 'a'
> 
> = overhead - overhead + marker - marker + 2*'a' - 'a'
> 
> = 0 + 0 + 'a'
> 
> = 1
> 
> 
> 
> The "marker" has magically disappeared.
> 
> 
> 
> jmf

[toc] | [prev] | [next] | [standalone]


#50487

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-12 03:18 +0000
Message-ID<51df7593$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to#50473
On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:

> And what to say about this "ucs4" char/string '\U0001d11e' which is
> weighting 18 bytes more than an "a".
> 
>>>> sys.getsizeof('\U0001d11e')
> 44
> 
> A total absurdity. 


You should stick to Python 3.1 and 3.2 then:

py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36


Now all your strings will be just as heavy, every single variable name 
and attribute name will use four times as much memory. Happy now?


> How does is come? Very simple, once you split Unicode
> in subsets, not only you have to handle these subsets, you have to
> create "markers" to differentiate them. Not only, you produce "markers",
> you have to handle the mess generated by these "markers". Hiding this
> markers in the everhead of the class does not mean that they should not
> be counted as part of the coding scheme. BTW, since when a serious
> coding scheme need an extermal marker?

Since always.

How do you think that (say) a C compiler can tell the difference between 
the long 1199876496 and the float 67923.125? They both have exactly the 
same four bytes:

py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'


*Everything* in a computer is bytes. The only way to tell them apart is 
by external markers.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#50553

Fromwxjmfauth@gmail.com
Date2013-07-12 14:42 -0700
Message-ID<5f8322b5-56f1-4dda-9dae-203453eb62b8@googlegroups.com>
In reply to#50487
Le vendredi 12 juillet 2013 05:18:44 UTC+2, Steven D'Aprano a écrit :
> On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:
> 
> 
> Now all your strings will be just as heavy, every single variable name 
> 
> and attribute name will use four times as much memory. Happy now?
> 

--------

>>> 㑖 = 999
>>> class C:
...     cœur = 'heart'
...

- Why always this magic number "four"?
- Are you able to think once non-ascii?
- Have you once had the feeling to be penalized,
because you are using fonts with OpenType technology?
- Have once had problem with pdf? I can tell you,
utf32 is peanuts compared to the used CID-font you
are using.
- Did you toy once with a unicode TeX engine?
- Did you take a look at a rendering engine code like HarfBuzz?


jmf

[toc] | [prev] | [next] | [standalone]


#50504

FromChris Angelico <rosuav@gmail.com>
Date2013-07-12 12:16 +1000
Message-ID<mailman.4619.1373613834.3114.python-list@python.org>
In reply to#50473
On Fri, Jul 12, 2013 at 4:42 AM,  <wxjmfauth@gmail.com> wrote:
> BTW, since
> when a serious coding scheme need an extermal marker?
>

All of them.

Content-type: text/plain; charset=UTF-8

ChrisA

[toc] | [prev] | [next] | [standalone]


#50581

Fromwxjmfauth@gmail.com
Date2013-07-13 00:56 -0700
Message-ID<2165f7cc-b9bf-41b1-b128-d33b522046dc@googlegroups.com>
In reply to#50504
Le vendredi 12 juillet 2013 04:16:21 UTC+2, Chris Angelico a écrit :
> On Fri, Jul 12, 2013 at 4:42 AM,  <wxjmfauth@gmail.com> wrote:
> 
> > BTW, since
> 
> > when a serious coding scheme need an extermal marker?
> 
> >
> 
> 
> 
> All of them.
> 
> 
> 
> Content-type: text/plain; charset=UTF-8
> 
> 
> 
> ChrisA

------


No one.

You are confusing the knowledge of a coding scheme and the intrisinc
information a "coding scheme" *may* have, in a mandatory way, to work
properly. These are conceptualy two different things.

I am convinced you are not conceptually understanding utf-8 very well.
I wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
Units".

A similar coding scheme: iso-6937 . 

Try to write an editor, a text widget, with with a coding
scheme like the Flexible String Represenation. You will
quickly notice, it is impossible (understand correctly).
(You do not need a computer, just a sheet of paper and a pencil)
Hint: what is the character at the caret position?

jmf

[toc] | [prev] | [next] | [standalone]


#50584

FromLele Gaifax <lele@metapensiero.it>
Date2013-07-13 10:24 +0200
Message-ID<mailman.4674.1373703883.3114.python-list@python.org>
In reply to#50581
wxjmfauth@gmail.com writes:

> Try to write an editor, a text widget, with with a coding
> scheme like the Flexible String Represenation. You will
> quickly notice, it is impossible (understand correctly).
> (You do not need a computer, just a sheet of paper and a pencil)
> Hint: what is the character at the caret position?

I am convinced you are not conceptually understanding FST very well.
Alternatively, you may have a strange notion of “impossible”.
Or both.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [next] | [standalone]


#50586

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-13 09:36 +0000
Message-ID<51e11f85$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to#50581
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:

> I am convinced you are not conceptually understanding utf-8 very well. I
> wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
> Units".

Just because you write it many times, doesn't make it correct. You are 
simply wrong. UTF-8 produces bytes. That's what gets written to files and 
transmitted over networks, bytes, not "Unicode Encoding Units", whatever 
they are.


> A similar coding scheme: iso-6937 .
> 
> Try to write an editor, a text widget, with with a coding scheme like
> the Flexible String Represenation. You will quickly notice, it is
> impossible (understand correctly). (You do not need a computer, just a
> sheet of paper and a pencil) Hint: what is the character at the caret
> position?

That is a simple index operation into the buffer. If the caret position 
is 10 characters in, you index buffer[10-1] and it will give you the 
character to the left of the caret. buffer[10] will give you the 
character to the right of the caret. It is simple, trivial, and easy. The 
buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4 
bytes.

Here is an example of such a tiny buffer, implemented in Python 3.3 with 
the hated Flexible String Representation. In each example, imagine the 
caret is five characters from the left:

12345|more characters here...

It works regardless of whether your characters are ASCII:


py> buffer = '12345ABCD...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'A'


Latin 1:

py> buffer = '12345áßçð...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'á'


Other BMP characters:

py> buffer = '12345αдᚪ∞...'
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'α'


And Supplementary Plane Characters:

py> buffer = ('12345'
...     '\N{ALCHEMICAL SYMBOL FOR AIR}'
...     '\N{ALCHEMICAL SYMBOL FOR FIRE}'
...     '\N{ALCHEMICAL SYMBOL FOR EARTH}'
...     '\N{ALCHEMICAL SYMBOL FOR WATER}'
...     '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
12
py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'🜁'
py> unicodedata.name(buffer[5])
'ALCHEMICAL SYMBOL FOR AIR'


And it all Just Works in Python 3.3. So much for "impossible to tell" 
what the character at the carat is. It is *trivial*.



Ah, but how about Python 3.2? We set up the same buffer:


py> buffer = ('12345'
...     '\N{ALCHEMICAL SYMBOL FOR AIR}'
...     '\N{ALCHEMICAL SYMBOL FOR FIRE}'
...     '\N{ALCHEMICAL SYMBOL FOR EARTH}'
...     '\N{ALCHEMICAL SYMBOL FOR WATER}'
...     '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
16

Sixteen? Sixteen? Where did the extra four characters come from? They 
came from *surrogate pairs*.


py> buffer[5-1]  # character to the left of the caret
'5'
py> buffer[5]  # character to the right of the caret
'\ud83d'


Funny, that looks different.


py> unicodedata.name(buffer[5])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name


No name?

Because buffer[5] is only *half* of the surrogate pair. It is broken, and 
there is really no way of fixing that breakage in Python 3.2 with a 
narrow build. You can fix it with a wide build, but only at the cost of 
every string, every name, using double the amount of storage, whether it 
needs it or not.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#50587

FromChris Angelico <rosuav@gmail.com>
Date2013-07-13 19:46 +1000
Message-ID<mailman.4676.1373708809.3114.python-list@python.org>
In reply to#50581
On Sat, Jul 13, 2013 at 5:56 PM,  <wxjmfauth@gmail.com> wrote:
> Try to write an editor, a text widget, with with a coding
> scheme like the Flexible String Represenation. You will
> quickly notice, it is impossible (understand correctly).
> (You do not need a computer, just a sheet of paper and a pencil)
> Hint: what is the character at the caret position?

I would use an internal representation that allows insertion and
deletion - in its simplest form, a list of strings. And those strings
would be whatever I can most conveniently work with.

I've never built a text editor widget, because my libraries always
provide them. But there is a rough parallel in the display storage for
Gypsum, which stores a series of lines, each of which is a series of
sections in different colors. (A line might be a single section, ie
one color for its whole length.) I store them in arrays of (color,
string, color, string, color, string...). The strings I use are in the
format wanted by my display subsystem - which in my case is the native
string type of the language, which... oh, what a pity for jmf, is a
flexible object that uses 8, 16, or 32 bits for each character.

ChrisA

[toc] | [prev] | [next] | [standalone]


#50588

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-13 09:49 +0000
Message-ID<51e12295$0$9505$c3e8da3$5496439d@news.astraweb.com>
In reply to#50581
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:

> You are confusing the knowledge of a coding scheme and the intrisinc
> information a "coding scheme" *may* have, in a mandatory way, to work
> properly. These are conceptualy two different things.

*May* have, in a *mandatory* way?

JMF, I know you are not a native English speaker, so you might not be 
aware just how silly your statement is. If it *may* have, it is optional, 
since it *may not* have instead. But if it is optional, it is not 
mandatory.

You are making so much fuss over such a simple, obvious implementation 
for strings. The language Pike has done the same thing for probably a 
decade or so.

Ironically, Python has done the same thing for integers for many versions 
too. They just didn't call it "Flexible Integer Representation", but 
that's what it is. For integers smaller than 2**31, they are stored as C 
longs (plus object overhead). For integers larger than 2**31, they are 
promoted to a BigNum implementation that can handle unlimited digits.

Using Python 2.7, where it is more obvious because the BigNum has an L 
appended to the display, and a different type:

py> for n in (1, 2**20, 2**30, 2**31, 2**65):
...     print repr(n), type(n), sys.getsizeof(n)
...
1 <type 'int'> 12
1048576 <type 'int'> 12
1073741824 <type 'int'> 12
2147483648L <type 'long'> 18
36893488147419103232L <type 'long'> 22


You have been using Flexible Integer Representation for *years*, and it 
works great, and you've never noticed any problems.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#50589

FromChris Angelico <rosuav@gmail.com>
Date2013-07-13 20:09 +1000
Message-ID<mailman.4677.1373710147.3114.python-list@python.org>
In reply to#50588
On Sat, Jul 13, 2013 at 7:49 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Ironically, Python has done the same thing for integers for many versions
> too. They just didn't call it "Flexible Integer Representation", but
> that's what it is. For integers smaller than 2**31, they are stored as C
> longs (plus object overhead). For integers larger than 2**31, they are
> promoted to a BigNum implementation that can handle unlimited digits.

Hmm. That's true of Python 2 (mostly - once an operation yields a
long, it never reverts to int, whereas a string will shrink if you
remove the wider characters from it), but not, I think, of Python 3.
The optimization isn't there any more. At least, I did some tinkering
a while ago (on 3.2, I think), so maybe it's been reinstated since. As
of Python 3 and the unification of types, it's definitely possible to
put that in as a pure optimization, anyhow.

ChrisA

[toc] | [prev] | [next] | [standalone]


#50596

Fromwxjmfauth@gmail.com
Date2013-07-13 07:37 -0700
Message-ID<bee89c63-2608-490d-b75f-62aa7c957223@googlegroups.com>
In reply to#50588
Le samedi 13 juillet 2013 11:49:10 UTC+2, Steven D'Aprano a écrit :
> On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:
> 
> 
> 
> > You are confusing the knowledge of a coding scheme and the intrisinc
> 
> > information a "coding scheme" *may* have, in a mandatory way, to work
> 
> > properly. These are conceptualy two different things.
> 
> 
> 
> *May* have, in a *mandatory* way?
> 
> 
> 
> JMF, I know you are not a native English speaker, so you might not be 
> 
> aware just how silly your statement is. If it *may* have, it is optional, 
> 
> since it *may not* have instead. But if it is optional, it is not 
> 
> mandatory.
> 
> 
> 
> You are making so much fuss over such a simple, obvious implementation 
> 
> for strings. The language Pike has done the same thing for probably a 
> 
> decade or so.
> 
> 
> 
> Ironically, Python has done the same thing for integers for many versions 
> 
> too. They just didn't call it "Flexible Integer Representation", but 
> 
> that's what it is. For integers smaller than 2**31, they are stored as C 
> 
> longs (plus object overhead). For integers larger than 2**31, they are 
> 
> promoted to a BigNum implementation that can handle unlimited digits.
> 
> 
> 
> Using Python 2.7, where it is more obvious because the BigNum has an L 
> 
> appended to the display, and a different type:
> 
> 
> 
> py> for n in (1, 2**20, 2**30, 2**31, 2**65):
> 
> ...     print repr(n), type(n), sys.getsizeof(n)
> 
> ...
> 
> 1 <type 'int'> 12
> 
> 1048576 <type 'int'> 12
> 
> 1073741824 <type 'int'> 12
> 
> 2147483648L <type 'long'> 18
> 
> 36893488147419103232L <type 'long'> 22
> 
> 
> 
> 
> 
> You have been using Flexible Integer Representation for *years*, and it 
> 
> works great, and you've never noticed any problems.
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

------

The FSR is naive and badly working. I can not force people
to understand the coding of the characters [*].

I'm the first to recognize that Python and/or Pike are
free to do what they wish.

Luckily, for the crowd, those who do not even know that the
coding of characters exists, all the serious actors active in
text processing are working properly.

jmf

* By nature characters and numbers are differents.

[toc] | [prev] | [next] | [standalone]


#50611

FromDave Angel <davea@davea.name>
Date2013-07-13 15:02 -0400
Message-ID<mailman.4684.1373742160.3114.python-list@python.org>
In reply to#50596
On 07/13/2013 10:37 AM, wxjmfauth@gmail.com wrote:

<SNIP>
>
> The FSR is naive and badly working. I can not force people
> to understand the coding of the characters [*].

That would be very hard, since you certainly do not.
>
> I'm the first to recognize that Python and/or Pike are
> free to do what they wish.

Fortunately for us, Python (in version 3.3 and later) and Pike did it 
right.  Some day the others may decide to do similarly.

>
> Luckily, for the crowd, those who do not even know that the
> coding of characters exists, all the serious actors active in
> text processing are working properly.

Here, I'm really glad you don't know English, because if you had a 
decent grasp of the language, somebody might assume you knew what you
were talking about.

>
> jmf
>
> * By nature characters and numbers are differents.
>

By nature Jmf has his own distorted reality.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


Page 1 of 3  [1] 2 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web