Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50110 > unrolled thread
| Started by | blatt <ferdy.blatsco@gmail.com> |
|---|---|
| First post | 2013-07-07 17:22 -0700 |
| Last post | 2013-07-13 04:51 +0000 |
| Articles | 20 on this page of 49 — 15 participants |
Back to article view | Back to comp.lang.python
hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000
Page 1 of 3 [1] 2 3 Next page →
| From | blatt <ferdy.blatsco@gmail.com> |
|---|---|
| Date | 2013-07-07 17:22 -0700 |
| Subject | hex dump w/ or w/out utf-8 chars |
| Message-ID | <a35609c1-e56f-4180-8176-4405264da0a2@googlegroups.com> |
Hi all,
but a particular hello to Chris Angelino which with their critics and
suggestions pushed me to make a full revision of my application on
hex dump in presence of utf-8 chars.
If you are not using python 3, the utf-8 codec can add further programming
problems, especially if you are not a guru....
The script seems very long but I commented too much ... sorry.
It is very useful (at least IMHO...)
It works under Linux. but there is still a little problem which I didn't
solve (at least programmatically...).
# -*- coding: utf-8 -*-
# px.py vers. 11 (pxb.py) # python 2.6.6
# hex-dump w/ or w/out utf-8 chars
# Using spaces as separators, this script shows
# (better than tabnanny) uncorrect indentations.
# to save output > python pxb.py hex.txt > px9_out_hex.txt
nLenN=3 # n. of digits for lines
# version almost thoroughly rewritten on the ground of
# the critics and modifications suggested by Chris Angelico
# in the first version the utf-8 conversion to hex was shown horizontaly:
# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a
# ... but I had to insert additional chars to keep the
# synchronization between the literal and the hex part
# 005 # qwerty: non è. unicode bensì. ascii
# 2 7767773 666 ca 7666666 6667ca 676660
# 3 175249a efe 38 5e93f45 25e33c 13399a
# in the second version I followed Chris suggestion:
# "to show the hex utf-8 vertically"
# 005 # qwerty: non è unicode bensì ascii
# 2 7767773 666 c 7666666 6667c 676660
# 3 175249a efe 3 5e93f45 25e33 13399a
# a a
# 8 c
# between the two solutions, I selected the first one + syncronization,
# which seems more compact and easier to program (... I'm lazy...)
# various run options:
# std : python px.py file
# bash cat : cat file | python px.py (alias hex)
# bash echo: echo line | python px.py " "
# works on any n. of bytes for utf-8
# For the user: it is helpful to have in a separate file
# all special characters of interest, together with their names.
# error:
# echo '345"789"'|hex > 345"789" 345"789"
# 33323332 instead of 333233320
# 3452789 a " " 34527892a
# ... correction: avoiding "\n at end of test-line
# echo "345'789'"|hex > 345'789'
# 333233320
# 34577897a
# same error in every run option
# If someone can solve this bug...
###################
import fileinput
import sys, commands
lF=[] # input file as list
for line in fileinput.input(): # handles all the details of args-or-stdin
lF.append(line)
sSpacesXLN = ' ' * (nLenN+1)
for n in xrange(len(lF)):
sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex =lF[n].encode('hex').replace('20',' ')
sLineHexH =sLineHex[::2]
sLineHexL =sLineHex[1::2]
sSynchro=''
for k in xrange(0,len(sLineHexND),2):
if sLineHexND[k]<'8':
sSynchro+= sLineHexND[k]+sLineHexND[k+1]
k+=1
elif sLineHexND[k]=='c':
sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+'2e'
k+=3
elif sLineHexND[k]=='e':
sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k+3]+\
sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
k+=5
# text output (synchroinized)
print str(n+1).zfill(nLenN)+' '+sSynchro.decode('hex'),
print sSpacesXLN + sLineHexH
print sSpacesXLN + sLineHexL+ '\n'
If there are problems of understanding, probably due to fonts, the best
thing is import it in an editor with "mono" fonts...
As I already told to Chris... critics are welcome!
Bye, Blatt.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-08 11:17 +1000 |
| Message-ID | <mailman.4362.1373246245.3114.python-list@python.org> |
| In reply to | #50110 |
On Mon, Jul 8, 2013 at 10:22 AM, blatt <ferdy.blatsco@gmail.com> wrote:
> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on
> hex dump in presence of utf-8 chars.
Hiya! Glad to have been of assistance :)
> As I already told to Chris... critics are welcome!
No problem.
> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py) # python 2.6.6
> # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows
> # (better than tabnanny) uncorrect indentations.
>
> # to save output > python pxb.py hex.txt > px9_out_hex.txt
>
> nLenN=3 # n. of digits for lines
>
> # chomp heaps and heaps of comments
Little nitpick, since you did invite criticism :) When I went to copy
and paste your code, I skipped all the comments and started at the
line of hashes... and then didn't have the nLenN definition. Posting
code to a forum like this is a huge invitation to try the code (it's
the very easiest way to know what it does), so I would recommend
having all your comments at the top, and all the code in a block
underneath. It'd be that bit easier for us to help you. Not a big
deal, though, I did figure out what was going on :)
> sLineHex =lF[n].encode('hex').replace('20',' ')
Here's the problem. Your hex string ends with "220a", and the
replace() method doesn't concern itself with the divisions between
bytes. It finds the second 2 of 22 and the leading 0 of 0a and
replaces them.
I think the best solution may be to avoid the .encode('hex') part,
since it's not available in Python 3 anyway. Alternatively (if Py3
migration isn't a concern), you could do something like this:
sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
sLineHex =sLineHexND # No reason to redo the encoding
twentypos=0
while True:
twentypos=sLineHex.find("20",twentypos)
if twentypos==-1: break # We've reached the end of the string
if not twentypos%2: # It's at an even-numbered position, replace it
sLineHex=sLineHex[:twentypos]+' '+sLineHex[twentypos+2:]
twentypos+=1
# then continue on as before
> sLineHexH =sLineHex[::2]
> sLineHexL =sLineHex[1::2]
> [ code continues ]
Hope that helps!
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-08 05:48 +0000 |
| Message-ID | <51da52c6$0$6512$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50110 |
On Sun, 07 Jul 2013 17:22:26 -0700, blatt wrote:
> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on hex
> dump in presence of utf-8 chars.
I don't understand what you are trying to say. All characters are UTF-8
characters. "a" is a UTF-8 character. So is "ă".
> If you are not using python 3, the utf-8 codec can add further
> programming problems,
On the contrary, I find that so long as you understand what you are doing
it solves problems, not adds them. However, if you are confused about the
difference between characters (text strings) and bytes, or if you are
dealing with arbitrary binary data and trying to treat it as if it were
UTF-8 encoded text, then you can have errors. Those errors are a good
thing.
> especially if you are not a guru.... The script
> seems very long but I commented too much ... sorry. It is very useful
> (at least IMHO...)
> It works under Linux. but there is still a little problem which I didn't
> solve (at least programmatically...).
>
>
> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)
> # python 2.6.6 # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows
> # (better than tabnanny) uncorrect indentations.
The word you are looking for is "incorrect".
> # to save output > python pxb.py hex.txt > px9_out_hex.txt
>
> nLenN=3 # n. of digits for lines
>
> # version almost thoroughly rewritten on the ground of
> # the critics and modifications suggested by Chris Angelico
>
> # in the first version the utf-8 conversion to hex was shown
> horizontaly:
>
> # 005 # qwerty: non è unicode bensì ascii
> # 2 7767773 666 ca 7666666 6667ca 676660
> # 3 175249a efe 38 5e93f45 25e33c 13399a
Oh! We're supposed to read the output *downwards*! That's not very
intuitive. It took me a while to work that out. You should at least say
so.
> # ... but I had to insert additional chars to keep the
> # synchronization between the literal and the hex part
>
> # 005 # qwerty: non è. unicode bensì. ascii
> # 2 7767773 666 ca 7666666 6667ca 676660
> # 3 175249a efe 38 5e93f45 25e33c 13399a
Well that sucks, because now sometimes you have to read downwards
(character 'q' -> hex 71, reading downwards) and sometimes you read both
downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a
dot and sometimes it means filler. How is the user supposed to know when
to read down and when across?
> # in the second version I followed Chris suggestion:
> # "to show the hex utf-8 vertically"
You're already showing UTF-8 characters vertically, if they happen to be
a one-byte character. Better to be consistent and always show characters
vertical, regardless of whether they are one, two or four bytes.
> # 005 # qwerty: non è unicode bensì ascii
> # 2 7767773 666 c 7666666 6667c 676660
> # 3 175249a efe 3 5e93f45 25e33 13399a
> # a a
> # 8 c
Much better! Now at least you can trivially read down the column to see
the bytes used for each character. As an alternative, you can space each
character to show the bytes horizontally, displaying spaces and other
invisible characters either as dots, backslash escapes, or Unicode
control pictures, whichever you prefer. The example below uses dots for
spaces and backslash escape for newline:
q w e r t y : . n o n . è . u n i
71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69
c o d e . b e n s ì . a s c i i \n
63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a
There will always be some ambiguity between (e.g.) dot representing a
dot, and it representing an invisible control character or space, but the
reader can always tell them apart by reading the hex value, which you
*always* read horizontally whether it is one byte, two or four. There's
never any confusion whether you should read down or across.
Unfortunately, most fonts don't support the Unicode control pictures. But
if you choose to use them, here they are, together with their Unicode
name. You can use the form
'\N{...}' # Python 3
u'\N{...}' # Python 2
to get the characters, replacing ... with the name shown below:
␀ SYMBOL FOR NULL
␁ SYMBOL FOR START OF HEADING
␂ SYMBOL FOR START OF TEXT
␃ SYMBOL FOR END OF TEXT
␄ SYMBOL FOR END OF TRANSMISSION
␅ SYMBOL FOR ENQUIRY
␆ SYMBOL FOR ACKNOWLEDGE
␇ SYMBOL FOR BELL
␈ SYMBOL FOR BACKSPACE
␉ SYMBOL FOR HORIZONTAL TABULATION
␊ SYMBOL FOR LINE FEED
␋ SYMBOL FOR VERTICAL TABULATION
␌ SYMBOL FOR FORM FEED
␍ SYMBOL FOR CARRIAGE RETURN
␎ SYMBOL FOR SHIFT OUT
␏ SYMBOL FOR SHIFT IN
␐ SYMBOL FOR DATA LINK ESCAPE
␑ SYMBOL FOR DEVICE CONTROL ONE
␒ SYMBOL FOR DEVICE CONTROL TWO
␓ SYMBOL FOR DEVICE CONTROL THREE
␔ SYMBOL FOR DEVICE CONTROL FOUR
␕ SYMBOL FOR NEGATIVE ACKNOWLEDGE
␖ SYMBOL FOR SYNCHRONOUS IDLE
␗ SYMBOL FOR END OF TRANSMISSION BLOCK
␘ SYMBOL FOR CANCEL
␙ SYMBOL FOR END OF MEDIUM
␚ SYMBOL FOR SUBSTITUTE
␛ SYMBOL FOR ESCAPE
␜ SYMBOL FOR FILE SEPARATOR
␝ SYMBOL FOR GROUP SEPARATOR
␞ SYMBOL FOR RECORD SEPARATOR
␟ SYMBOL FOR UNIT SEPARATOR
␠ SYMBOL FOR SPACE
␡ SYMBOL FOR DELETE
␢ BLANK SYMBOL
␣ OPEN BOX
 SYMBOL FOR NEWLINE
␥ SYMBOL FOR DELETE FORM TWO
␦ SYMBOL FOR SUBSTITUTE FORM TWO
(I wish more fonts would support these characters, they are very useful.)
[...]
> # works on any n. of bytes for utf-8
>
> # For the user: it is helpful to have in a separate file
> # all special characters of interest, together with their names.
In Python, you can use the unicodedata module to look up characters by
name, or given the character, find out what it's name is.
[...]
> import fileinput
> import sys, commands
>
> lF=[] # input file as list
> for line in fileinput.input(): # handles all the details of args-or-
stdin
> lF.append(line)
That is more easily written as:
lF = list(fileinput.input())
and better written with a meaningful file name. Whenever you have a
variable, and find the need to give a comment explaining what the
variable name means, you should consider a more descriptive name.
When that name is a cryptic two letter name, that goes double.
> sSpacesXLN = ' ' * (nLenN+1)
>
>
> for n in xrange(len(lF)):
> sLineHexND=lF[n].encode('hex') # ND = no delimiter (space)
You're programming like a Pascal or C programmer. There is nearly never
any need to write code like that in Python. Rather than iterate over the
indexes, then extract the part you want, it is better to iterate directly
over the parts you want:
for line in lF:
sLineHexND = line.encode('hex')
> sLineHex =lF[n].encode('hex').replace('20',' ')
> sLineHexH =sLineHex[::2]
> sLineHexL =sLineHex[1::2]
Trying to keep code lined up in this way is a bad habit to get into. It
just sets you up for many hours of unproductive adding and deleting
spaces trying to keep things aligned.
Also, what on earth are all these "s" prefixes?
> sSynchro=''
> for k in xrange(0,len(sLineHexND),2):
Probably the best way to walk through a string, grabbing the characters
in pairs, comes from the itertools module: see the recipe for "grouper".
http://docs.python.org/2/library/itertools.html
Here is a simplified version:
assert len(line) % 2 == 0
for pair in zip(*([iter(line)]*2)):
...
although understanding how it works requires a little advanced knowledge.
> if sLineHexND[k]<'8':
> sSynchro+= sLineHexND[k]+sLineHexND[k+1]
> k+=1
> elif sLineHexND[k]=='c':
> sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+'2e'
> k+=3
> elif sLineHexND[k]=='e':
> sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+\
> sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
> k+=5
Apart from being hideously ugly to read, I do not believe this code works
the way you think it works. Adding to the loop variable doesn't advance
the loop. Try this and see for yourself:
for i in range(10):
print(i)
i += 5
The loop variable just gets reset once it reaches the top of the loop
again.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | ferdy.blatsco@gmail.com |
|---|---|
| Date | 2013-07-08 10:31 -0700 |
| Message-ID | <7ef8c0e7-7f7c-4a22-89a9-50f62c4a8064@googlegroups.com> |
| In reply to | #50110 |
Hi Chris, glad to have received your contribution, but I was expecting much more critics... Starting from the "little nitpick" about the comment dispositon in my script... you are correct... It is a bad habit on my part to place variables subjected to change at the beginning of the script... and then forget it... About the mistake due to replace, you gave me a perfect explanation. Unfortunately (as probably I told you before) I will never pass to Python 3... Guido should not always listen only to gurus like him... I don't like Python as before...starting from OOP and ending with codecs like utf-8. Regarding OOP, much appreciated expecially by experts, he could use python 2 for hiding the complexities of OOP (improving, as an effect, object's code hiding) moving classes and objects to imported methods, leaving in this way the programming style to the well known old style: sequential programming and functions. About utf-8... the same solution: keep utf-8 but for the non experts, add methods to convert to solutions which use the range 128-255 of only one byte (I do not give a damn about chinese and "similia"!...) I know that is a lost battle (in italian "una battaglia persa")! Bye, Blatt
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-09 03:52 +1000 |
| Message-ID | <mailman.4391.1373305945.3114.python-list@python.org> |
| In reply to | #50162 |
On Tue, Jul 9, 2013 at 3:31 AM, <ferdy.blatsco@gmail.com> wrote: > Unfortunately (as probably I told you before) I will never pass to > Python 3... Guido should not always listen only to gurus like him... > I don't like Python as before...starting from OOP and ending with codecs > like utf-8. Regarding OOP, much appreciated expecially by experts, he > could use python 2 for hiding the complexities of OOP (improving, as an > effect, object's code hiding) moving classes and objects to > imported methods, leaving in this way the programming style to the > well known old style: sequential programming and functions. > About utf-8... the same solution: keep utf-8 but for the non experts, add > methods to convert to solutions which use the range 128-255 of only one > byte (I do not give a damn about chinese and "similia"!...) > I know that is a lost battle (in italian "una battaglia persa")! Well, there won't be a Python 2.8, so you really should consider moving at some point. Python 3.3 is already way better than 2.7 in many ways, 3.4 will improve on 3.3, and the future is pretty clear. But nobody's forcing you, and 2.7.x will continue to get bugfix/security releases for a while. (Personally, I'd be happy if everyone moved off the 2.3/2.4 releases. It's not too hard supporting 2.6+ or 2.7+.) The thing is, you're thinking about UTF-8, but you should be thinking about Unicode. I recommend you read these articles: http://www.joelonsoftware.com/articles/Unicode.html http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ So long as you are thinking about different groups of characters as different, and wanting a solution that maps characters down into the <256 range, you will never be able to cleanly internationalize. With Python 3.3+, you can ignore the differences between ASCII, BMP, and SMP characters; they're all just "characters". Everything works perfectly with Unicode. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-11 06:18 -0700 |
| Message-ID | <a3a4aa9b-3a5c-42cd-9a04-4c02f962b71e@googlegroups.com> |
| In reply to | #50164 |
Le lundi 8 juillet 2013 19:52:17 UTC+2, Chris Angelico a écrit :
> On Tue, Jul 9, 2013 at 3:31 AM, <ferdy.blatsco@gmail.com> wrote:
>
> > Unfortunately (as probably I told you before) I will never pass to
>
> > Python 3... Guido should not always listen only to gurus like him...
>
> > I don't like Python as before...starting from OOP and ending with codecs
>
> > like utf-8. Regarding OOP, much appreciated expecially by experts, he
>
> > could use python 2 for hiding the complexities of OOP (improving, as an
>
> > effect, object's code hiding) moving classes and objects to
>
> > imported methods, leaving in this way the programming style to the
>
> > well known old style: sequential programming and functions.
>
> > About utf-8... the same solution: keep utf-8 but for the non experts, add
>
> > methods to convert to solutions which use the range 128-255 of only one
>
> > byte (I do not give a damn about chinese and "similia"!...)
>
> > I know that is a lost battle (in italian "una battaglia persa")!
>
>
>
> Well, there won't be a Python 2.8, so you really should consider
>
> moving at some point. Python 3.3 is already way better than 2.7 in
>
> many ways, 3.4 will improve on 3.3, and the future is pretty clear.
>
> But nobody's forcing you, and 2.7.x will continue to get
>
> bugfix/security releases for a while. (Personally, I'd be happy if
>
> everyone moved off the 2.3/2.4 releases. It's not too hard supporting
>
> 2.6+ or 2.7+.)
>
>
>
> The thing is, you're thinking about UTF-8, but you should be thinking
>
> about Unicode. I recommend you read these articles:
>
>
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
>
>
>
> So long as you are thinking about different groups of characters as
>
> different, and wanting a solution that maps characters down into the
>
> <256 range, you will never be able to cleanly internationalize. With
>
> Python 3.3+, you can ignore the differences between ASCII, BMP, and
>
> SMP characters; they're all just "characters". Everything works
>
> perfectly with Unicode.
>
-----------
Just to stick with this funny character ẞ, a ucs-2 char
in the Flexible String Representation nomenclature.
It seems to me that, when one needs more than ten bytes
to encode it,
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ẞ')
40
this is far away from the perfection.
BTW, for a modern language, is not ucs2 considered
as obsolete since many, many years?
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-11 23:32 +1000 |
| Message-ID | <mailman.4585.1373549528.3114.python-list@python.org> |
| In reply to | #50443 |
On Thu, Jul 11, 2013 at 11:18 PM, <wxjmfauth@gmail.com> wrote:
> Just to stick with this funny character ẞ, a ucs-2 char
> in the Flexible String Representation nomenclature.
>
> It seems to me that, when one needs more than ten bytes
> to encode it,
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ẞ')
> 40
>
> this is far away from the perfection.
Better comparison is to see how much space is used by one copy of it,
and how much by two copies:
>>> sys.getsizeof('aa')-sys.getsizeof('a')
1
>>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
2
String objects have overhead. Big deal.
> BTW, for a modern language, is not ucs2 considered
> as obsolete since many, many years?
Clearly. And similarly, the 16-bit integer has been completely
obsoleted, as there is no reason anyone should ever bother to use it.
Same with the float type - everyone uses double or better these days,
right?
http://www.postgresql.org/docs/current/static/datatype-numeric.html
http://www.cplusplus.com/doc/tutorial/variables/
Nope, nobody uses small integers any more, they're clearly completely obsolete.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-11 11:42 -0700 |
| Message-ID | <26d5c832-eaa1-439e-af61-e2855af2cd18@googlegroups.com> |
| In reply to | #50444 |
Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 11, 2013 at 11:18 PM, <wxjmfauth@gmail.com> wrote:
>
> > Just to stick with this funny character ẞ, a ucs-2 char
>
> > in the Flexible String Representation nomenclature.
>
> >
>
> > It seems to me that, when one needs more than ten bytes
>
> > to encode it,
>
> >
>
> >>>> sys.getsizeof('a')
>
> > 26
>
> >>>> sys.getsizeof('ẞ')
>
> > 40
>
> >
>
> > this is far away from the perfection.
>
>
>
> Better comparison is to see how much space is used by one copy of it,
>
> and how much by two copies:
>
>
>
> >>> sys.getsizeof('aa')-sys.getsizeof('a')
>
> 1
>
> >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
>
> 2
>
>
>
> String objects have overhead. Big deal.
>
>
>
> > BTW, for a modern language, is not ucs2 considered
>
> > as obsolete since many, many years?
>
>
>
> Clearly. And similarly, the 16-bit integer has been completely
>
> obsoleted, as there is no reason anyone should ever bother to use it.
>
> Same with the float type - everyone uses double or better these days,
>
> right?
>
>
>
> http://www.postgresql.org/docs/current/static/datatype-numeric.html
>
> http://www.cplusplus.com/doc/tutorial/variables/
>
>
>
> Nope, nobody uses small integers any more, they're clearly completely obsolete.
>
>
>
Sure there is some overhead because a str is a class.
It still remain that a "ẞ" weights 14 bytes more than
an "a".
In "aẞ", the ẞ weights 6 bytes.
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('aẞ')
42
and in "aẞẞ", the ẞ weights 2 bytes
sys.getsizeof('aẞẞ')
And what to say about this "ucs4" char/string '\U0001d11e' which
is weighting 18 bytes more than an "a".
>>> sys.getsizeof('\U0001d11e')
44
A total absurdity. How does is come? Very simple, once you
split Unicode in subsets, not only you have to handle these
subsets, you have to create "markers" to differentiate them.
Not only, you produce "markers", you have to handle the
mess generated by these "markers". Hiding this markers
in the everhead of the class does not mean that they should
not be counted as part of the coding scheme. BTW, since
when a serious coding scheme need an extermal marker?
>>> sys.getsizeof('aa') - sys.getsizeof('a')
1
Shortly, if my algebra is still correct:
(overhead + marker + 2*'a') - (overhead + marker + 'a')
= (overhead + marker + 2*'a') - overhead - marker - 'a')
= overhead - overhead + marker - marker + 2*'a' - 'a'
= 0 + 0 + 'a'
= 1
The "marker" has magically disappeared.
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-11 11:44 -0700 |
| Message-ID | <d9ebccf8-050c-4dcc-ad51-ff50868a1287@googlegroups.com> |
| In reply to | #50473 |
Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm...@gmail.com a écrit :
> Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
>
> > On Thu, Jul 11, 2013 at 11:18 PM, <wxjmfauth@gmail.com> wrote:
>
> >
>
> > > Just to stick with this funny character ẞ, a ucs-2 char
>
> >
>
> > > in the Flexible String Representation nomenclature.
>
> >
>
> > >
>
> >
>
> > > It seems to me that, when one needs more than ten bytes
>
> >
>
> > > to encode it,
>
> >
>
> > >
>
> >
>
> > >>>> sys.getsizeof('a')
>
> >
>
> > > 26
>
> >
>
> > >>>> sys.getsizeof('ẞ')
>
> >
>
> > > 40
>
> >
>
> > >
>
> >
>
> > > this is far away from the perfection.
>
> >
>
> >
>
> >
>
> > Better comparison is to see how much space is used by one copy of it,
>
> >
>
> > and how much by two copies:
>
> >
>
> >
>
> >
>
> > >>> sys.getsizeof('aa')-sys.getsizeof('a')
>
> >
>
> > 1
>
> >
>
> > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
>
> >
>
> > 2
>
> >
>
> >
>
> >
>
> > String objects have overhead. Big deal.
>
> >
>
> >
>
> >
>
> > > BTW, for a modern language, is not ucs2 considered
>
> >
>
> > > as obsolete since many, many years?
>
> >
>
> >
>
> >
>
> > Clearly. And similarly, the 16-bit integer has been completely
>
> >
>
> > obsoleted, as there is no reason anyone should ever bother to use it.
>
> >
>
> > Same with the float type - everyone uses double or better these days,
>
> >
>
> > right?
>
> >
>
> >
>
> >
>
> > http://www.postgresql.org/docs/current/static/datatype-numeric.html
>
> >
>
> > http://www.cplusplus.com/doc/tutorial/variables/
>
> >
>
> >
>
> >
>
> > Nope, nobody uses small integers any more, they're clearly completely obsolete.
>
> >
>
> >
>
> >
>
>
>
> Sure there is some overhead because a str is a class.
>
> It still remain that a "ẞ" weights 14 bytes more than
>
> an "a".
>
>
>
> In "aẞ", the ẞ weights 6 bytes.
>
>
>
> >>> sys.getsizeof('a')
>
> 26
>
> >>> sys.getsizeof('aẞ')
>
> 42
>
>
>
> and in "aẞẞ", the ẞ weights 2 bytes
>
>
>
> sys.getsizeof('aẞẞ')
>
>
>
> And what to say about this "ucs4" char/string '\U0001d11e' which
>
> is weighting 18 bytes more than an "a".
>
>
>
> >>> sys.getsizeof('\U0001d11e')
>
> 44
>
>
>
> A total absurdity. How does is come? Very simple, once you
>
> split Unicode in subsets, not only you have to handle these
>
> subsets, you have to create "markers" to differentiate them.
>
> Not only, you produce "markers", you have to handle the
>
> mess generated by these "markers". Hiding this markers
>
> in the everhead of the class does not mean that they should
>
> not be counted as part of the coding scheme. BTW, since
>
> when a serious coding scheme need an extermal marker?
>
>
>
>
>
>
>
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
>
> 1
>
>
>
> Shortly, if my algebra is still correct:
>
>
>
> (overhead + marker + 2*'a') - (overhead + marker + 'a')
>
> = (overhead + marker + 2*'a') - overhead - marker - 'a'
>
> = overhead - overhead + marker - marker + 2*'a' - 'a'
>
> = 0 + 0 + 'a'
>
> = 1
>
>
>
> The "marker" has magically disappeared.
>
>
>
> jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-12 03:18 +0000 |
| Message-ID | <51df7593$0$9505$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50473 |
On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote:
> And what to say about this "ucs4" char/string '\U0001d11e' which is
> weighting 18 bytes more than an "a".
>
>>>> sys.getsizeof('\U0001d11e')
> 44
>
> A total absurdity.
You should stick to Python 3.1 and 3.2 then:
py> print(sys.version)
3.1.3 (r313:86834, Nov 28 2010, 11:28:10)
[GCC 4.4.5]
py> sys.getsizeof('\U0001d11e')
36
py> sys.getsizeof('a')
36
Now all your strings will be just as heavy, every single variable name
and attribute name will use four times as much memory. Happy now?
> How does is come? Very simple, once you split Unicode
> in subsets, not only you have to handle these subsets, you have to
> create "markers" to differentiate them. Not only, you produce "markers",
> you have to handle the mess generated by these "markers". Hiding this
> markers in the everhead of the class does not mean that they should not
> be counted as part of the coding scheme. BTW, since when a serious
> coding scheme need an extermal marker?
Since always.
How do you think that (say) a C compiler can tell the difference between
the long 1199876496 and the float 67923.125? They both have exactly the
same four bytes:
py> import struct
py> struct.pack('f', 67923.125)
b'\x90\xa9\x84G'
py> struct.pack('l', 1199876496)
b'\x90\xa9\x84G'
*Everything* in a computer is bytes. The only way to tell them apart is
by external markers.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-12 14:42 -0700 |
| Message-ID | <5f8322b5-56f1-4dda-9dae-203453eb62b8@googlegroups.com> |
| In reply to | #50487 |
Le vendredi 12 juillet 2013 05:18:44 UTC+2, Steven D'Aprano a écrit : > On Thu, 11 Jul 2013 11:42:26 -0700, wxjmfauth wrote: > > > Now all your strings will be just as heavy, every single variable name > > and attribute name will use four times as much memory. Happy now? > -------- >>> 㑖 = 999 >>> class C: ... cœur = 'heart' ... - Why always this magic number "four"? - Are you able to think once non-ascii? - Have you once had the feeling to be penalized, because you are using fonts with OpenType technology? - Have once had problem with pdf? I can tell you, utf32 is peanuts compared to the used CID-font you are using. - Did you toy once with a unicode TeX engine? - Did you take a look at a rendering engine code like HarfBuzz? jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-12 12:16 +1000 |
| Message-ID | <mailman.4619.1373613834.3114.python-list@python.org> |
| In reply to | #50473 |
On Fri, Jul 12, 2013 at 4:42 AM, <wxjmfauth@gmail.com> wrote: > BTW, since > when a serious coding scheme need an extermal marker? > All of them. Content-type: text/plain; charset=UTF-8 ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-13 00:56 -0700 |
| Message-ID | <2165f7cc-b9bf-41b1-b128-d33b522046dc@googlegroups.com> |
| In reply to | #50504 |
Le vendredi 12 juillet 2013 04:16:21 UTC+2, Chris Angelico a écrit : > On Fri, Jul 12, 2013 at 4:42 AM, <wxjmfauth@gmail.com> wrote: > > > BTW, since > > > when a serious coding scheme need an extermal marker? > > > > > > > All of them. > > > > Content-type: text/plain; charset=UTF-8 > > > > ChrisA ------ No one. You are confusing the knowledge of a coding scheme and the intrisinc information a "coding scheme" *may* have, in a mandatory way, to work properly. These are conceptualy two different things. I am convinced you are not conceptually understanding utf-8 very well. I wrote many times, "utf-8 does not produce bytes, but Unicode Encoding Units". A similar coding scheme: iso-6937 . Try to write an editor, a text widget, with with a coding scheme like the Flexible String Represenation. You will quickly notice, it is impossible (understand correctly). (You do not need a computer, just a sheet of paper and a pencil) Hint: what is the character at the caret position? jmf
[toc] | [prev] | [next] | [standalone]
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2013-07-13 10:24 +0200 |
| Message-ID | <mailman.4674.1373703883.3114.python-list@python.org> |
| In reply to | #50581 |
wxjmfauth@gmail.com writes: > Try to write an editor, a text widget, with with a coding > scheme like the Flexible String Represenation. You will > quickly notice, it is impossible (understand correctly). > (You do not need a computer, just a sheet of paper and a pencil) > Hint: what is the character at the caret position? I am convinced you are not conceptually understanding FST very well. Alternatively, you may have a strange notion of “impossible”. Or both. ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-13 09:36 +0000 |
| Message-ID | <51e11f85$0$9505$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50581 |
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote:
> I am convinced you are not conceptually understanding utf-8 very well. I
> wrote many times, "utf-8 does not produce bytes, but Unicode Encoding
> Units".
Just because you write it many times, doesn't make it correct. You are
simply wrong. UTF-8 produces bytes. That's what gets written to files and
transmitted over networks, bytes, not "Unicode Encoding Units", whatever
they are.
> A similar coding scheme: iso-6937 .
>
> Try to write an editor, a text widget, with with a coding scheme like
> the Flexible String Represenation. You will quickly notice, it is
> impossible (understand correctly). (You do not need a computer, just a
> sheet of paper and a pencil) Hint: what is the character at the caret
> position?
That is a simple index operation into the buffer. If the caret position
is 10 characters in, you index buffer[10-1] and it will give you the
character to the left of the caret. buffer[10] will give you the
character to the right of the caret. It is simple, trivial, and easy. The
buffer itself knows whether to look ahead 10 bytes, 10*2 bytes or 10*4
bytes.
Here is an example of such a tiny buffer, implemented in Python 3.3 with
the hated Flexible String Representation. In each example, imagine the
caret is five characters from the left:
12345|more characters here...
It works regardless of whether your characters are ASCII:
py> buffer = '12345ABCD...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'A'
Latin 1:
py> buffer = '12345áßçð...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'á'
Other BMP characters:
py> buffer = '12345αдᚪ∞...'
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'α'
And Supplementary Plane Characters:
py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
12
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'🜁'
py> unicodedata.name(buffer[5])
'ALCHEMICAL SYMBOL FOR AIR'
And it all Just Works in Python 3.3. So much for "impossible to tell"
what the character at the carat is. It is *trivial*.
Ah, but how about Python 3.2? We set up the same buffer:
py> buffer = ('12345'
... '\N{ALCHEMICAL SYMBOL FOR AIR}'
... '\N{ALCHEMICAL SYMBOL FOR FIRE}'
... '\N{ALCHEMICAL SYMBOL FOR EARTH}'
... '\N{ALCHEMICAL SYMBOL FOR WATER}'
... '...')
py> buffer
'12345🜁🜂🜃🜄...'
py> len(buffer)
16
Sixteen? Sixteen? Where did the extra four characters come from? They
came from *surrogate pairs*.
py> buffer[5-1] # character to the left of the caret
'5'
py> buffer[5] # character to the right of the caret
'\ud83d'
Funny, that looks different.
py> unicodedata.name(buffer[5])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
No name?
Because buffer[5] is only *half* of the surrogate pair. It is broken, and
there is really no way of fixing that breakage in Python 3.2 with a
narrow build. You can fix it with a wide build, but only at the cost of
every string, every name, using double the amount of storage, whether it
needs it or not.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-13 19:46 +1000 |
| Message-ID | <mailman.4676.1373708809.3114.python-list@python.org> |
| In reply to | #50581 |
On Sat, Jul 13, 2013 at 5:56 PM, <wxjmfauth@gmail.com> wrote: > Try to write an editor, a text widget, with with a coding > scheme like the Flexible String Represenation. You will > quickly notice, it is impossible (understand correctly). > (You do not need a computer, just a sheet of paper and a pencil) > Hint: what is the character at the caret position? I would use an internal representation that allows insertion and deletion - in its simplest form, a list of strings. And those strings would be whatever I can most conveniently work with. I've never built a text editor widget, because my libraries always provide them. But there is a rough parallel in the display storage for Gypsum, which stores a series of lines, each of which is a series of sections in different colors. (A line might be a single section, ie one color for its whole length.) I store them in arrays of (color, string, color, string, color, string...). The strings I use are in the format wanted by my display subsystem - which in my case is the native string type of the language, which... oh, what a pity for jmf, is a flexible object that uses 8, 16, or 32 bits for each character. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-13 09:49 +0000 |
| Message-ID | <51e12295$0$9505$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #50581 |
On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote: > You are confusing the knowledge of a coding scheme and the intrisinc > information a "coding scheme" *may* have, in a mandatory way, to work > properly. These are conceptualy two different things. *May* have, in a *mandatory* way? JMF, I know you are not a native English speaker, so you might not be aware just how silly your statement is. If it *may* have, it is optional, since it *may not* have instead. But if it is optional, it is not mandatory. You are making so much fuss over such a simple, obvious implementation for strings. The language Pike has done the same thing for probably a decade or so. Ironically, Python has done the same thing for integers for many versions too. They just didn't call it "Flexible Integer Representation", but that's what it is. For integers smaller than 2**31, they are stored as C longs (plus object overhead). For integers larger than 2**31, they are promoted to a BigNum implementation that can handle unlimited digits. Using Python 2.7, where it is more obvious because the BigNum has an L appended to the display, and a different type: py> for n in (1, 2**20, 2**30, 2**31, 2**65): ... print repr(n), type(n), sys.getsizeof(n) ... 1 <type 'int'> 12 1048576 <type 'int'> 12 1073741824 <type 'int'> 12 2147483648L <type 'long'> 18 36893488147419103232L <type 'long'> 22 You have been using Flexible Integer Representation for *years*, and it works great, and you've never noticed any problems. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-13 20:09 +1000 |
| Message-ID | <mailman.4677.1373710147.3114.python-list@python.org> |
| In reply to | #50588 |
On Sat, Jul 13, 2013 at 7:49 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > Ironically, Python has done the same thing for integers for many versions > too. They just didn't call it "Flexible Integer Representation", but > that's what it is. For integers smaller than 2**31, they are stored as C > longs (plus object overhead). For integers larger than 2**31, they are > promoted to a BigNum implementation that can handle unlimited digits. Hmm. That's true of Python 2 (mostly - once an operation yields a long, it never reverts to int, whereas a string will shrink if you remove the wider characters from it), but not, I think, of Python 3. The optimization isn't there any more. At least, I did some tinkering a while ago (on 3.2, I think), so maybe it's been reinstated since. As of Python 3 and the unification of types, it's definitely possible to put that in as a pure optimization, anyhow. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-13 07:37 -0700 |
| Message-ID | <bee89c63-2608-490d-b75f-62aa7c957223@googlegroups.com> |
| In reply to | #50588 |
Le samedi 13 juillet 2013 11:49:10 UTC+2, Steven D'Aprano a écrit : > On Sat, 13 Jul 2013 00:56:52 -0700, wxjmfauth wrote: > > > > > You are confusing the knowledge of a coding scheme and the intrisinc > > > information a "coding scheme" *may* have, in a mandatory way, to work > > > properly. These are conceptualy two different things. > > > > *May* have, in a *mandatory* way? > > > > JMF, I know you are not a native English speaker, so you might not be > > aware just how silly your statement is. If it *may* have, it is optional, > > since it *may not* have instead. But if it is optional, it is not > > mandatory. > > > > You are making so much fuss over such a simple, obvious implementation > > for strings. The language Pike has done the same thing for probably a > > decade or so. > > > > Ironically, Python has done the same thing for integers for many versions > > too. They just didn't call it "Flexible Integer Representation", but > > that's what it is. For integers smaller than 2**31, they are stored as C > > longs (plus object overhead). For integers larger than 2**31, they are > > promoted to a BigNum implementation that can handle unlimited digits. > > > > Using Python 2.7, where it is more obvious because the BigNum has an L > > appended to the display, and a different type: > > > > py> for n in (1, 2**20, 2**30, 2**31, 2**65): > > ... print repr(n), type(n), sys.getsizeof(n) > > ... > > 1 <type 'int'> 12 > > 1048576 <type 'int'> 12 > > 1073741824 <type 'int'> 12 > > 2147483648L <type 'long'> 18 > > 36893488147419103232L <type 'long'> 22 > > > > > > You have been using Flexible Integer Representation for *years*, and it > > works great, and you've never noticed any problems. > > > > > > > > -- > > Steven ------ The FSR is naive and badly working. I can not force people to understand the coding of the characters [*]. I'm the first to recognize that Python and/or Pike are free to do what they wish. Luckily, for the crowd, those who do not even know that the coding of characters exists, all the serious actors active in text processing are working properly. jmf * By nature characters and numbers are differents.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-07-13 15:02 -0400 |
| Message-ID | <mailman.4684.1373742160.3114.python-list@python.org> |
| In reply to | #50596 |
On 07/13/2013 10:37 AM, wxjmfauth@gmail.com wrote: <SNIP> > > The FSR is naive and badly working. I can not force people > to understand the coding of the characters [*]. That would be very hard, since you certainly do not. > > I'm the first to recognize that Python and/or Pike are > free to do what they wish. Fortunately for us, Python (in version 3.3 and later) and Pike did it right. Some day the others may decide to do similarly. > > Luckily, for the crowd, those who do not even know that the > coding of characters exists, all the serious actors active in > text processing are working properly. Here, I'm really glad you don't know English, because if you had a decent grasp of the language, somebody might assume you knew what you were talking about. > > jmf > > * By nature characters and numbers are differents. > By nature Jmf has his own distorted reality. -- DaveA
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web