Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #27730 > unrolled thread

Flexible string representation, unicode, typography, ...

Started bywxjmfauth@gmail.com
First post2012-08-23 05:47 -0700
Last post2012-08-25 07:23 -0400
Articles 20 on this page of 95 — 21 participants

Back to article view | Back to comp.lang.python


Contents

  Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 05:47 -0700
    Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-23 23:57 +1000
      Re: Flexible string representation, unicode, typography, ... MRAB <python@mrabarnett.plus.com> - 2012-08-23 16:11 +0100
      Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 09:19 -0600
      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 11:33 -0700
        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 13:22 -0600
          Re: Flexible string representation, unicode, typography, ... rusi <rustompmody@gmail.com> - 2012-08-24 09:06 -0700
            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-24 17:47 +0100
            Re: Flexible string representation, unicode, typography, ... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-24 14:34 -0400
        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 20:34 +0100
    Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 15:18 +0100
    Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-08-24 07:38 -0700
      Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
          Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
          Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
          Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
                  Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                        Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                            Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                                  Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                        Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                            Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                                Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                                Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                                Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                                  Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                    Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                                Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                                Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                                Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                                Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                              Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                              Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                                Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                              Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                            Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                                  Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                            Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
          Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
          Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 1 of 5  [1] 2 3 4 5  Next page →


#27730 — Flexible string representation, unicode, typography, ...

Fromwxjmfauth@gmail.com
Date2012-08-23 05:47 -0700
SubjectFlexible string representation, unicode, typography, ...
Message-ID<a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com>
This is neither a complaint nor a question, just a comment.

In the previous discussion related to the flexible
string representation, Roy Smith added this comment:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42

Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."

he used in his comment a very intersting word: "punctuation".

There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.

Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.

A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers. 

The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.

Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.

Python, in my mind, fell in this trap.

"Simple is better than complex."
  -> hard to maintained
"Flat is better than nested." 
  -> code points range
"Special cases aren't special enough to break the rules."
  -> special unicode code points?
"Although practicality beats purity."
 -> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
  -> guessing a user will only work with the "optimmized" char subrange.
...

Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.

>> sys.getsizeof('a' * 80 * 50)
4025
>>> sys.getsizeof('a' * 80 * 50 + '•')
8040

Just my 2 € (code point 0x20ac) cents.

jmf

[toc] | [next] | [standalone]


#27733

FromNeil Hodgson <nhodgson@iinet.net.au>
Date2012-08-23 23:57 +1000
Message-ID<D7udnfbyKvHEqqvNnZ2dnUVZ_sidnZ2d@westnet.com.au>
In reply to#27730
wxjmfauth@gmail.com:

> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
> and you will see all the optimization efforts destroyed.
>
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040

    This example is still benefiting from shrinking the number of bytes 
in half over using 32 bits per character as was the case with Python 3.2:

 >>> sys.getsizeof('a' * 80 * 50)
16032
 >>> sys.getsizeof('a' * 80 * 50 + '•')
16036
 >>>

    Neil

[toc] | [prev] | [next] | [standalone]


#27740

FromMRAB <python@mrabarnett.plus.com>
Date2012-08-23 16:11 +0100
Message-ID<mailman.3717.1345734660.4697.python-list@python.org>
In reply to#27733
On 23/08/2012 14:57, Neil Hodgson wrote:
> wxjmfauth@gmail.com:
>
>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>> characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
>> and you will see all the optimization efforts destroyed.
>>
>>>> sys.getsizeof('a' * 80 * 50)
>> 4025
>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>> 8040
>
>      This example is still benefiting from shrinking the number of bytes
> in half over using 32 bits per character as was the case with Python 3.2:
>
>   >>> sys.getsizeof('a' * 80 * 50)
> 16032
>   >>> sys.getsizeof('a' * 80 * 50 + '•')
> 16036
>   >>>
>
Perhaps the solution should've been to just switch between 2/4 bytes 
instead
of 1/2/4 bytes. :-)

[toc] | [prev] | [next] | [standalone]


#27741

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-08-23 09:19 -0600
Message-ID<mailman.3718.1345735195.4697.python-list@python.org>
In reply to#27733
On Thu, Aug 23, 2012 at 9:11 AM, MRAB <python@mrabarnett.plus.com> wrote:
> Perhaps the solution should've been to just switch between 2/4 bytes instead
> of 1/2/4 bytes. :-)

Why?  You don't lose any complexity by doing that.  I can see
arguments for 1/2/4 or for just 4, but I can't see any advantage of
2/4 over either of those.

[toc] | [prev] | [next] | [standalone]


#27757

Fromwxjmfauth@gmail.com
Date2012-08-23 11:33 -0700
Message-ID<7eaafbcd-597d-4f8c-98a8-ecb537e6e065@googlegroups.com>
In reply to#27733
Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
> wxjmfauth@gmail.com:
> 
> 
> 
> > Small illustration. Take an a4 page containing 50 lines of 80 ascii
> 
> > characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
> 
> > and you will see all the optimization efforts destroyed.
> 
> >
> 
> >>> sys.getsizeof('a' * 80 * 50)
> 
> > 4025
> 
> >>>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> > 8040
> 
> 
> 
>     This example is still benefiting from shrinking the number of bytes 
> 
> in half over using 32 bits per character as was the case with Python 3.2:
> 
> 
> 
>  >>> sys.getsizeof('a' * 80 * 50)
> 
> 16032
> 
>  >>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> 16036
> 
Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf

[toc] | [prev] | [next] | [standalone]


#27762

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-08-23 13:22 -0600
Message-ID<mailman.3730.1345749768.4697.python-list@python.org>
In reply to#27757
On Thu, Aug 23, 2012 at 12:33 PM,  <wxjmfauth@gmail.com> wrote:
>> >>> sys.getsizeof('a' * 80 * 50)
>>
>> > 4025
>>
>> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> > 8040
>>
>>
>>
>>     This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>>  >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>>  >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.

What are you talking about?  Surely it happens the same number of
times that your example happens, since it's the same example.  By
dismissing this example as being too infrequent to be of any
importance, you dismiss the validity of your own example as well.

> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.

So what?  Similarly, for any generalized data compression algorithm,
it is possible to engineer inputs for which the "compressed" output is
as large as or larger than the original input (this is easy to prove).
 Does this mean that compression algorithms are useless?  I hardly
think so, as evidenced by the widespread popularity of tools like gzip
and WinZip.

You seem to be saying that because we cannot pack all Unicode strings
into 1-byte or 2-byte per character representations, we should just
give up and force everybody to use maximum-width representations for
all strings.  That is absurd.

> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)

Obviously, it is because I want to have the *ability* to represent all
those characters in my strings, even if I am not necessarily going to
take advantage of that ability in every single string that I produce.
Not all of the strings I use are going to fit into the 1-byte or
2-byte per character representation.  Fine, whatever -- that's part of
the cost of internationalization.  However, *most* of the strings that
I work with (this entire email message, for instance) -- and, I think,
most of the strings that any developer works with (identifiers in the
standard library, for instance) -- will fit into at least the 2-byte
per character representation.  Why shackle every string everywhere to
4 bytes per character when for a majority of them we can do much
better than that?

[toc] | [prev] | [next] | [standalone]


#27809

Fromrusi <rustompmody@gmail.com>
Date2012-08-24 09:06 -0700
Message-ID<a657deea-b429-4662-898e-c500ef592556@f4g2000pbq.googlegroups.com>
In reply to#27762
On Aug 24, 12:22 am, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> On Thu, Aug 23, 2012 at 12:33 PM,  <wxjmfa...@gmail.com> wrote:
> >> >>> sys.getsizeof('a' * 80 * 50)
>
> >> > 4025
>
> >> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> > 8040
>
> >>     This example is still benefiting from shrinking the number of bytes
>
> >> in half over using 32 bits per character as was the case with Python 3.2:
>
> >>  >>> sys.getsizeof('a' * 80 * 50)
>
> >> 16032
>
> >>  >>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> 16036
>
> > Correct, but how many times does it happen?
> > Practically never.
>
> What are you talking about?  Surely it happens the same number of
> times that your example happens, since it's the same example.  By
> dismissing this example as being too infrequent to be of any
> importance, you dismiss the validity of your own example as well.
>
> > In this unicode stuff, I'm fascinated by the obsession
> > to solve a problem which is, due to the nature of
> > Unicode, unsolvable.
>
> > For every optimization algorithm, for every code
> > point range you can optimize, it is always possible
> > to find a case breaking that optimization.
>
> So what?  Similarly, for any generalized data compression algorithm,
> it is possible to engineer inputs for which the "compressed" output is
> as large as or larger than the original input (this is easy to prove).
>  Does this mean that compression algorithms are useless?  I hardly
> think so, as evidenced by the widespread popularity of tools like gzip
> and WinZip.
>
> You seem to be saying that because we cannot pack all Unicode strings
> into 1-byte or 2-byte per character representations, we should just
> give up and force everybody to use maximum-width representations for
> all strings.  That is absurd.
>
> > Sure, it is possible to optimize the unicode usage
> > by not using French characters, punctuation, mathematical
> > symbols, currency symbols, CJK characters...
> > (select undesired characters here:http://www.unicode.org/charts/).
>
> > In that case, why using unicode?
> > (A problematic not specific to Python)
>
> Obviously, it is because I want to have the *ability* to represent all
> those characters in my strings, even if I am not necessarily going to
> take advantage of that ability in every single string that I produce.
> Not all of the strings I use are going to fit into the 1-byte or
> 2-byte per character representation.  Fine, whatever -- that's part of
> the cost of internationalization.  However, *most* of the strings that
> I work with (this entire email message, for instance) -- and, I think,
> most of the strings that any developer works with (identifiers in the
> standard library, for instance) -- will fit into at least the 2-byte
> per character representation.  Why shackle every string everywhere to
> 4 bytes per character when for a majority of them we can do much
> better than that?

Actually what exactly are you (jmf) asking for?
Its not clear to anybody as best as we can see...

[toc] | [prev] | [next] | [standalone]


#27814

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-24 17:47 +0100
Message-ID<mailman.3761.1345826801.4697.python-list@python.org>
In reply to#27809
On 24/08/2012 17:06, rusi wrote:

>
> Actually what exactly are you (jmf) asking for?
> Its not clear to anybody as best as we can see...
>

A knee in the temple and a dagger up the <censored> ? :)  From another 
Monty Python sketch for those who don't know.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27818

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2012-08-24 14:34 -0400
Message-ID<mailman.3765.1345833280.4697.python-list@python.org>
In reply to#27809
On Fri, 24 Aug 2012 17:47:42 +0100, Mark Lawrence
<breamoreboy@yahoo.co.uk> declaimed the following in
gmane.comp.python.general:

> 
> A knee in the temple and a dagger up the <censored> ? :)  From another 
> Monty Python sketch for those who don't know.

	A poignard in the codpiece...
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#27763

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-23 20:34 +0100
Message-ID<mailman.3731.1345750334.4697.python-list@python.org>
In reply to#27757
On 23/08/2012 19:33, wxjmfauth@gmail.com wrote:
> Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
>> wxjmfauth@gmail.com:
>>
>>
>>
>>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>>
>>> characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
>>
>>> and you will see all the optimization efforts destroyed.
>>
>>>
>>
>>>>> sys.getsizeof('a' * 80 * 50)
>>
>>> 4025
>>
>>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>>> 8040
>>
>>
>>
>>      This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>>   >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>>   >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.
>
> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.
>
> This follows quasi the mathematical logic. To proof a
> law is valid, you have to proof all the cases
> are valid. To proof a law is invalid, just find one
> case showing it.
>
> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)
>
> jmf
>

What do you propose should be used instead, as you appear to be the 
resident expert in the field?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27736

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-23 15:18 +0100
Message-ID<mailman.3715.1345731438.4697.python-list@python.org>
In reply to#27730
On 23/08/2012 13:47, wxjmfauth@gmail.com wrote:
> This is neither a complaint nor a question, just a comment.
>
> In the previous discussion related to the flexible
> string representation, Roy Smith added this comment:
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
>
> Not only I agree with his sentence:
> "Clearly, the world has moved to a 32-bit character set."
>
> he used in his comment a very intersting word: "punctuation".
>
> There is a point which is, in my mind, not very well understood,
> "digested", underestimated or neglected by many developers:
> the relation between the coding of the characters and the typography.
>
> Unicode (the consortium), does not only deal with the coding of
> the characters, it also worked on the characters *classification*.
>
> A deliberatly simplistic representation: "letters" in the bottom
> of the table, lower code points/integers; "typographic characters"
> like punctuation, common symbols, ... high in the table, high code
> points/integers.
>
> The conclusion is inescapable, if one wish to work in a "unicode
> mode", one is forced to use the whole palette of the unicode
> code points, this is the *nature* of Unicode.
>
> Technically, believing that it possible to optimize only a subrange
> of the unicode code points range is simply an illusion. A lot of
> work, probably quite complicate, which finally solves nothing.
>
> Python, in my mind, fell in this trap.
>
> "Simple is better than complex."
>    -> hard to maintained
> "Flat is better than nested."
>    -> code points range
> "Special cases aren't special enough to break the rules."
>    -> special unicode code points?
> "Although practicality beats purity."
>   -> or the opposite?
> "In the face of ambiguity, refuse the temptation to guess."
>    -> guessing a user will only work with the "optimmized" char subrange.
> ...
>
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> and you will see all the optimization efforts destroyed.
>
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040
>
> Just my 2 € (code point 0x20ac) cents.
>
> jmf
>

I'm looking forward to all the patches you are going to provide to 
correct all these (presumably) cPython deficiencies.  When do they start 
arriving on the bug tracker?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27802

FromRamchandra Apte <maniandram01@gmail.com>
Date2012-08-24 07:38 -0700
Message-ID<1874857c-68ef-4c1b-b15a-46ef47df9445@googlegroups.com>
In reply to#27730
On Thursday, 23 August 2012 18:17:29 UTC+5:30, (unknown)  wrote:
> This is neither a complaint nor a question, just a comment.
> 
> 
> 
> In the previous discussion related to the flexible
> 
> string representation, Roy Smith added this comment:
> 
> 
> 
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
> 
> 
> 
> Not only I agree with his sentence:
> 
> "Clearly, the world has moved to a 32-bit character set."
> 
> 
> 
> he used in his comment a very intersting word: "punctuation".
> 
> 
> 
> There is a point which is, in my mind, not very well understood,
> 
> "digested", underestimated or neglected by many developers:
> 
> the relation between the coding of the characters and the typography.
> 
> 
> 
> Unicode (the consortium), does not only deal with the coding of
> 
> the characters, it also worked on the characters *classification*.
> 
> 
> 
> A deliberatly simplistic representation: "letters" in the bottom
> 
> of the table, lower code points/integers; "typographic characters"
> 
> like punctuation, common symbols, ... high in the table, high code
> 
> points/integers. 
> 
> 
> 
> The conclusion is inescapable, if one wish to work in a "unicode
> 
> mode", one is forced to use the whole palette of the unicode
> 
> code points, this is the *nature* of Unicode.
> 
> 
> 
> Technically, believing that it possible to optimize only a subrange
> 
> of the unicode code points range is simply an illusion. A lot of
> 
> work, probably quite complicate, which finally solves nothing.
> 
> 
> 
> Python, in my mind, fell in this trap.
> 
> 
> 
> "Simple is better than complex."
> 
>   -> hard to maintained
> 
> "Flat is better than nested." 
> 
>   -> code points range
> 
> "Special cases aren't special enough to break the rules."
> 
>   -> special unicode code points?
> 
> "Although practicality beats purity."
> 
>  -> or the opposite?
> 
> "In the face of ambiguity, refuse the temptation to guess."
> 
>   -> guessing a user will only work with the "optimmized" char subrange.
> 
> ...
> 
> 
> 
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> 
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> 
> and you will see all the optimization efforts destroyed.
> 
> 
> 
> >> sys.getsizeof('a' * 80 * 50)
> 
> 4025
> 
> >>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> 8040
> 
> 
> 
> Just my 2 € (code point 0x20ac) cents.
> 
> 
> 
> jmf

The zen of python is simply a guideline

[toc] | [prev] | [next] | [standalone]


#27843

FromAntoine Pitrou <solipsis@pitrou.net>
Date2012-08-25 00:24 +0000
Message-ID<mailman.3784.1345854291.4697.python-list@python.org>
In reply to#27802
Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> The zen of python is simply a guideline

What's more, the Zen guides the language's design, not its implementation.
People who think CPython is a complicated implementation can take a look at PyPy 
:-)

Regards

Antoine.


-- 
Software development and contracting: http://pro.pitrou.net

[toc] | [prev] | [next] | [standalone]


#27853

Fromwxjmfauth@gmail.com
Date2012-08-25 00:27 -0700
Message-ID<mailman.3788.1345879639.4697.python-list@python.org>
In reply to#27843
Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> > 
> 
> > The zen of python is simply a guideline
> 
> 
> 
> What's more, the Zen guides the language's design, not its implementation.
> 
> People who think CPython is a complicated implementation can take a look at PyPy 
> 
> :-)

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

[toc] | [prev] | [next] | [standalone]


#27855

FromBen Finney <ben+python@benfinney.id.au>
Date2012-08-25 17:54 +1000
Message-ID<87sjbbe78w.fsf@benfinney.id.au>
In reply to#27853
wxjmfauth@gmail.com writes:

> Unicode design: a flat table of code points, where all code
> points are "equals".

Yes, Unicode's design entails a flat table of hundreds of thousands of
code points, expansible in future.

This is in direct conflict with the design of all significant computers
we need to write software for: data stored and transported as 8-bit
bytes, which can only ever hold 256 different values, no expansion.

> As soon as one attempts to escape from this rule, one has to
> "pay" for it.

Yes, in either direction; the conflict means that trade-offs need to be
made.

See this presentation by Ned Batchelder, “Pragmatic Unicode”
<URL:http://nedbatchelder.com/text/unipain.html>, which lays out the
fundamental conflict of representing human text in computer data; and
several practical approaches to deal with it.

-- 
 \      “I busted a mirror and got seven years bad luck, but my lawyer |
  `\                        thinks he can get me five.” —Steven Wright |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#27854

Fromwxjmfauth@gmail.com
Date2012-08-25 00:27 -0700
Message-ID<1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com>
In reply to#27843
Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> > 
> 
> > The zen of python is simply a guideline
> 
> 
> 
> What's more, the Zen guides the language's design, not its implementation.
> 
> People who think CPython is a complicated implementation can take a look at PyPy 
> 
> :-)

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

[toc] | [prev] | [next] | [standalone]


#27858

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-25 09:58 +0100
Message-ID<mailman.3791.1345885204.4697.python-list@python.org>
In reply to#27854
On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
>> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
>>
>>>
>>
>>> The zen of python is simply a guideline
>>
>>
>>
>> What's more, the Zen guides the language's design, not its implementation.
>>
>> People who think CPython is a complicated implementation can take a look at PyPy
>>
>> :-)
>
> Unicode design: a flat table of code points, where all code
> points are "equals".
> As soon as one attempts to escape from this rule, one has to
> "pay" for it.
> The creator of this machinery (flexible string representation)
> can not even benefit from it in his native language (I think
> I'm correctly informed).
>
> Hint: Google -> "Das grosse Eszett"
>
> jmf
>

It's Saturday morning, I'm stone cold sober, had a good sleep and I'm 
still baffled as to the point if any.  Could someone please enlightem me?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27860

FromFrank Millman <frank@chagford.com>
Date2012-08-25 11:46 +0200
Message-ID<mailman.3793.1345888006.4697.python-list@python.org>
In reply to#27854
On 25/08/2012 10:58, Mark Lawrence wrote:
> On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
>>
>> Unicode design: a flat table of code points, where all code
>> points are "equals".
>> As soon as one attempts to escape from this rule, one has to
>> "pay" for it.
>> The creator of this machinery (flexible string representation)
>> can not even benefit from it in his native language (I think
>> I'm correctly informed).
>>
>> Hint: Google -> "Das grosse Eszett"
>>
>> jmf
>>
>
> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> still baffled as to the point if any.  Could someone please enlightem me?
>

Here's what I think he is saying. I am posting this to test the water. I 
am also confused, and if I have got it wrong hopefully someone will 
correct me.

In python 3.3, unicode strings are now stored as follows -
   if all characters can be represented by 1 byte, the entire string is 
composed of 1-byte characters
   else if all characters can be represented by 1 or 2 bytea, the entire 
string is composed of 2-byte characters
   else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number 
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as 
the rest of the world will tend to have strings where at least one 
character requires 2 or 4 bytes. So they incur the overhead, without 
getting any benefit.

Therefore, I think he is saying that he would have preferred that python 
standardise on 4-byte characters, on the grounds that the saving in 
memory does not justify the performance overhead.

Frank Millman

[toc] | [prev] | [next] | [standalone]


#27876

Fromwxjmfauth@gmail.com
Date2012-08-25 08:47 -0700
Message-ID<mailman.3805.1345909675.4697.python-list@python.org>
In reply to#27860
Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
> 
> > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> 
> >>
> 
> >> Unicode design: a flat table of code points, where all code
> 
> >> points are "equals".
> 
> >> As soon as one attempts to escape from this rule, one has to
> 
> >> "pay" for it.
> 
> >> The creator of this machinery (flexible string representation)
> 
> >> can not even benefit from it in his native language (I think
> 
> >> I'm correctly informed).
> 
> >>
> 
> >> Hint: Google -> "Das grosse Eszett"
> 
> >>
> 
> >> jmf
> 
> >>
> 
> >
> 
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> 
> > still baffled as to the point if any.  Could someone please enlightem me?
> 
> >
> 
> 
> 
> Here's what I think he is saying. I am posting this to test the water. I 
> 
> am also confused, and if I have got it wrong hopefully someone will 
> 
> correct me.
> 
> 
> 
> In python 3.3, unicode strings are now stored as follows -
> 
>    if all characters can be represented by 1 byte, the entire string is 
> 
> composed of 1-byte characters
> 
>    else if all characters can be represented by 1 or 2 bytea, the entire 
> 
> string is composed of 2-byte characters
> 
>    else the entire string is composed of 4-byte characters
> 
> 
> 
> There is an overhead in making this choice, to detect the lowest number 
> 
> of bytes required.
> 
> 
> 
> jmfauth believes that this only benefits 'english-speaking' users, as 
> 
> the rest of the world will tend to have strings where at least one 
> 
> character requires 2 or 4 bytes. So they incur the overhead, without 
> 
> getting any benefit.
> 
> 
> 
> Therefore, I think he is saying that he would have preferred that python 
> 
> standardise on 4-byte characters, on the grounds that the saving in 
> 
> memory does not justify the performance overhead.
> 
> 
> 
> Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.


100% Unicode compliant from the day 0. Congratulations.

jmf

[toc] | [prev] | [next] | [standalone]


#27878

Fromwxjmfauth@gmail.com
Date2012-08-25 08:47 -0700
Message-ID<f6266544-d67c-4589-a3ed-c14428ead237@googlegroups.com>
In reply to#27860
Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
> 
> > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> 
> >>
> 
> >> Unicode design: a flat table of code points, where all code
> 
> >> points are "equals".
> 
> >> As soon as one attempts to escape from this rule, one has to
> 
> >> "pay" for it.
> 
> >> The creator of this machinery (flexible string representation)
> 
> >> can not even benefit from it in his native language (I think
> 
> >> I'm correctly informed).
> 
> >>
> 
> >> Hint: Google -> "Das grosse Eszett"
> 
> >>
> 
> >> jmf
> 
> >>
> 
> >
> 
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> 
> > still baffled as to the point if any.  Could someone please enlightem me?
> 
> >
> 
> 
> 
> Here's what I think he is saying. I am posting this to test the water. I 
> 
> am also confused, and if I have got it wrong hopefully someone will 
> 
> correct me.
> 
> 
> 
> In python 3.3, unicode strings are now stored as follows -
> 
>    if all characters can be represented by 1 byte, the entire string is 
> 
> composed of 1-byte characters
> 
>    else if all characters can be represented by 1 or 2 bytea, the entire 
> 
> string is composed of 2-byte characters
> 
>    else the entire string is composed of 4-byte characters
> 
> 
> 
> There is an overhead in making this choice, to detect the lowest number 
> 
> of bytes required.
> 
> 
> 
> jmfauth believes that this only benefits 'english-speaking' users, as 
> 
> the rest of the world will tend to have strings where at least one 
> 
> character requires 2 or 4 bytes. So they incur the overhead, without 
> 
> getting any benefit.
> 
> 
> 
> Therefore, I think he is saying that he would have preferred that python 
> 
> standardise on 4-byte characters, on the grounds that the saving in 
> 
> memory does not justify the performance overhead.
> 
> 
> 
> Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.


100% Unicode compliant from the day 0. Congratulations.

jmf

[toc] | [prev] | [next] | [standalone]


Page 1 of 5  [1] 2 3 4 5  Next page →

Back to top | Article view | comp.lang.python


csiph-web