Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50474

Re: hex dump w/ or w/out utf-8 chars

Newsgroups comp.lang.python
Date 2013-07-11 11:44 -0700
References (1 earlier) <7ef8c0e7-7f7c-4a22-89a9-50f62c4a8064@googlegroups.com> <mailman.4391.1373305945.3114.python-list@python.org> <a3a4aa9b-3a5c-42cd-9a04-4c02f962b71e@googlegroups.com> <mailman.4585.1373549528.3114.python-list@python.org> <26d5c832-eaa1-439e-af61-e2855af2cd18@googlegroups.com>
Message-ID <d9ebccf8-050c-4dcc-ad51-ff50868a1287@googlegroups.com> (permalink)
Subject Re: hex dump w/ or w/out utf-8 chars
From wxjmfauth@gmail.com

Show all headers | View raw


Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm...@gmail.com a écrit :
> Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit :
> 
> > On Thu, Jul 11, 2013 at 11:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > 
> 
> > > Just to stick with this funny character ẞ, a ucs-2 char
> 
> > 
> 
> > > in the Flexible String Representation nomenclature.
> 
> > 
> 
> > >
> 
> > 
> 
> > > It seems to me that, when one needs more than ten bytes
> 
> > 
> 
> > > to encode it,
> 
> > 
> 
> > >
> 
> > 
> 
> > >>>> sys.getsizeof('a')
> 
> > 
> 
> > > 26
> 
> > 
> 
> > >>>> sys.getsizeof('ẞ')
> 
> > 
> 
> > > 40
> 
> > 
> 
> > >
> 
> > 
> 
> > > this is far away from the perfection.
> 
> > 
> 
> > 
> 
> > 
> 
> > Better comparison is to see how much space is used by one copy of it,
> 
> > 
> 
> > and how much by two copies:
> 
> > 
> 
> > 
> 
> > 
> 
> > >>> sys.getsizeof('aa')-sys.getsizeof('a')
> 
> > 
> 
> > 1
> 
> > 
> 
> > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ')
> 
> > 
> 
> > 2
> 
> > 
> 
> > 
> 
> > 
> 
> > String objects have overhead. Big deal.
> 
> > 
> 
> > 
> 
> > 
> 
> > > BTW, for a modern language, is not ucs2 considered
> 
> > 
> 
> > > as obsolete since many, many years?
> 
> > 
> 
> > 
> 
> > 
> 
> > Clearly. And similarly, the 16-bit integer has been completely
> 
> > 
> 
> > obsoleted, as there is no reason anyone should ever bother to use it.
> 
> > 
> 
> > Same with the float type - everyone uses double or better these days,
> 
> > 
> 
> > right?
> 
> > 
> 
> > 
> 
> > 
> 
> > http://www.postgresql.org/docs/current/static/datatype-numeric.html
> 
> > 
> 
> > http://www.cplusplus.com/doc/tutorial/variables/
> 
> > 
> 
> > 
> 
> > 
> 
> > Nope, nobody uses small integers any more, they're clearly completely obsolete.
> 
> > 
> 
> > 
> 
> > 
> 
> 
> 
> Sure there is some overhead because a str is a class.
> 
> It still remain that a "ẞ" weights 14 bytes more than
> 
> an "a".
> 
> 
> 
> In "aẞ", the ẞ weights 6 bytes.
> 
> 
> 
> >>> sys.getsizeof('a')
> 
> 26
> 
> >>> sys.getsizeof('aẞ')
> 
> 42
> 
> 
> 
> and in "aẞẞ", the ẞ weights 2 bytes
> 
> 
> 
> sys.getsizeof('aẞẞ')
> 
> 
> 
> And what to say about this "ucs4" char/string '\U0001d11e' which
> 
> is weighting 18 bytes more than an "a".
> 
> 
> 
> >>> sys.getsizeof('\U0001d11e')
> 
> 44
> 
> 
> 
> A total absurdity. How does is come? Very simple, once you
> 
> split Unicode in subsets, not only you have to handle these
> 
> subsets, you have to create "markers" to differentiate them.
> 
> Not only, you produce "markers", you have to handle the
> 
> mess generated by these "markers". Hiding this markers
> 
> in the everhead of the class does not mean that they should
> 
> not be counted as part of the coding scheme. BTW, since
> 
> when a serious coding scheme need an extermal marker?
> 
> 
> 
> 
> 
> 
> 
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
> 
> 1
> 
> 
> 
> Shortly, if my algebra is still correct:
> 
> 
> 
> (overhead + marker + 2*'a') - (overhead + marker + 'a')
> 
> = (overhead + marker + 2*'a') - overhead - marker - 'a'
> 
> = overhead - overhead + marker - marker + 2*'a' - 'a'
> 
> = 0 + 0 + 'a'
> 
> = 1
> 
> 
> 
> The "marker" has magically disappeared.
> 
> 
> 
> jmf

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

hex dump w/ or w/out utf-8 chars blatt <ferdy.blatsco@gmail.com> - 2013-07-07 17:22 -0700
  Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-08 11:17 +1000
  Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-08 05:48 +0000
  Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:31 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 03:52 +1000
      Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 06:18 -0700
        Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-11 23:32 +1000
          Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:42 -0700
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-11 11:44 -0700
            Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-12 03:18 +0000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-12 14:42 -0700
            Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-12 12:16 +1000
              Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 00:56 -0700
                Re: hex dump w/ or w/out utf-8 chars Lele Gaifax <lele@metapensiero.it> - 2013-07-13 10:24 +0200
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:36 +0000
                Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 19:46 +1000
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 09:49 +0000
                Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-13 20:09 +1000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-13 07:37 -0700
                Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-13 15:02 -0400
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 01:20 -0700
                Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-14 10:44 +0000
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-14 06:44 -0700
                Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-24 06:28 -0700
                Re: hex dump w/ or w/out utf-8 chars Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-14 09:17 +1000
  Re: hex dump w/ or w/out utf-8 chars ferdy.blatsco@gmail.com - 2013-07-08 10:53 -0700
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 04:07 +1000
    Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 16:56 -0400
      Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 12:22 +0000
        Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 08:54 -0400
          Re: hex dump w/ or w/out utf-8 chars Neil Cerutti <neilc@norwich.edu> - 2013-07-09 13:00 +0000
            Re: hex dump w/ or w/out utf-8 chars Skip Montanaro <skip@pobox.com> - 2013-07-09 08:18 -0500
            Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-09 09:23 -0400
    Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-08 22:38 +0100
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 07:49 +1000
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:53 +0000
    Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua.landau.ws@gmail.com> - 2013-07-08 23:02 +0100
    Re: hex dump w/ or w/out utf-8 chars Dave Angel <davea@davea.name> - 2013-07-08 18:45 -0400
    Re: hex dump w/ or w/out utf-8 chars Chris Angelico <rosuav@gmail.com> - 2013-07-09 08:51 +1000
    Re: hex dump w/ or w/out utf-8 chars MRAB <python@mrabarnett.plus.com> - 2013-07-09 00:32 +0100
      Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 06:46 +0000
    Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 07:00 +0000
      Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-09 02:34 -0700
        Re: hex dump w/ or w/out utf-8 chars Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-07-09 12:15 +0200
          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-09 16:32 +0000
            Re: hex dump w/ or w/out utf-8 chars wxjmfauth@gmail.com - 2013-07-10 01:52 -0700
        Re: hex dump w/ or w/out utf-8 chars Joshua Landau <joshua@landau.ws> - 2013-07-12 23:01 +0100
          Re: hex dump w/ or w/out utf-8 chars Tim Roberts <timr@probo.com> - 2013-07-12 20:42 -0700
          Re: hex dump w/ or w/out utf-8 chars Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 04:51 +0000

csiph-web