Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50473

Re: hex dump w/ or w/out utf-8 chars

Newsgroups comp.lang.python
Date 2013-07-11 11:42 -0700
References <a35609c1-e56f-4180-8176-4405264da0a2@googlegroups.com> <7ef8c0e7-7f7c-4a22-89a9-50f62c4a8064@googlegroups.com> <mailman.4391.1373305945.3114.python-list@python.org> <a3a4aa9b-3a5c-42cd-9a04-4c02f962b71e@googlegroups.com> <mailman.4585.1373549528.3114.python-list@python.org>
Message-ID <26d5c832-eaa1-439e-af61-e2855af2cd18@googlegroups.com> (permalink)
Subject Re: hex dump w/ or w/out utf-8 chars
From wxjmfauth@gmail.com

Show all headers | View raw


Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 11, 2013 at 11:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > in the Flexible String Representation nomenclature.
> 
> >
> 
> > It seems to me that, when one needs more than ten bytes
> 
> > to encode it,
> 
> >
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('ẞ')
> 
> > 40
> 
> >
> 
> > this is far away from the perfection.
> 
> 
> 
> Better comparison is to see how much space is used by one copy of it,
> 
> and how much by two copies:
> 
> 
> 
> >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> 1
> 
> >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> 2
> 
> 
> 
> String objects have overhead. Big deal.
> 
> 
> 
> > BTW, for a modern language, is not ucs2 considered
> 
> > as obsolete since many, many years?
> 
> 
> 
> Clearly. And similarly, the 16-bit integer has been completely
> 
> obsoleted, as there is no reason anyone should ever bother to use it.
> 
> Same with the float type - everyone uses double or better these days,
> 
> right?
> 
> 
> 
> http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> http://www.cplusplus.com/doc/tutorial/variables/
> 
> 
> 
> Nope, nobody uses small integers any more, they're clearly completely obsolete.
> 
> 
> 

Sure there is some overhead because a str is a class.
It still remain that a "ẞ" weights 14 bytes more than
an "a".

In "aẞ", the ẞ weights 6 bytes.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('aẞ')
42

and in "aẞẞ", the ẞ weights 2 bytes

sys.getsizeof('aẞẞ')

And what to say about this "ucs4" char/string '\U0001d11e' which
is weighting 18 bytes more than an "a".

>>> sys.getsizeof('\U0001d11e')
44

A total absurdity. How does is come? Very simple, once you
split Unicode in subsets, not only you have to handle these
subsets, you have to create "markers" to differentiate them.
Not only, you produce "markers", you have to handle the
mess generated by these "markers". Hiding this markers
in the everhead of the class does not mean that they should
not be counted as part of the coding scheme. BTW, since
when a serious coding scheme need an extermal marker?



>>> sys.getsizeof('aa') - sys.getsizeof('a')
1

Shortly, if my algebra is still correct:

(overhead + marker + 2*'a') - (overhead + marker + 'a')
= (overhead + marker + 2*'a') - overhead - marker - 'a')
= overhead - overhead + marker - marker + 2*'a' - 'a'
= 0 + 0 + 'a'
= 1

The "marker" has magically disappeared.

jmf

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
  Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
  Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
      Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
        Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
          Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
            Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
                Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
                Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
                Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
                Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
                Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
  Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
    Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
      Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
        Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
          Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
            Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
            Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
    Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
    Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
    Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
    Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
    Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
      Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
        Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
        Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
          Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000

csiph-web