Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Mark Lawrence <breamoreboy@yahoo.co.uk>
Subject: Re: Flexible string representation, unicode, typography, ...
Date: Thu, 23 Aug 2012 15:18:05 +0100
References: <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20120713 Thunderbird/14.0
In-Reply-To: <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3715.1345731438.4697.python-list@python.org>
Lines: 70
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27736

On 23/08/2012 13:47, wxjmfauth@gmail.com wrote:
> This is neither a complaint nor a question, just a comment.
>
> In the previous discussion related to the flexible
> string representation, Roy Smith added this comment:
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
>
> Not only I agree with his sentence:
> "Clearly, the world has moved to a 32-bit character set."
>
> he used in his comment a very intersting word: "punctuation".
>
> There is a point which is, in my mind, not very well understood,
> "digested", underestimated or neglected by many developers:
> the relation between the coding of the characters and the typography.
>
> Unicode (the consortium), does not only deal with the coding of
> the characters, it also worked on the characters *classification*.
>
> A deliberatly simplistic representation: "letters" in the bottom
> of the table, lower code points/integers; "typographic characters"
> like punctuation, common symbols, ... high in the table, high code
> points/integers.
>
> The conclusion is inescapable, if one wish to work in a "unicode
> mode", one is forced to use the whole palette of the unicode
> code points, this is the *nature* of Unicode.
>
> Technically, believing that it possible to optimize only a subrange
> of the unicode code points range is simply an illusion. A lot of
> work, probably quite complicate, which finally solves nothing.
>
> Python, in my mind, fell in this trap.
>
> "Simple is better than complex."
>    -> hard to maintained
> "Flat is better than nested."
>    -> code points range
> "Special cases aren't special enough to break the rules."
>    -> special unicode code points?
> "Although practicality beats purity."
>   -> or the opposite?
> "In the face of ambiguity, refuse the temptation to guess."
>    -> guessing a user will only work with the "optimmized" char subrange.
> ...
>
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> and you will see all the optimization efforts destroyed.
>
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040
>
> Just my 2 € (code point 0x20ac) cents.
>
> jmf
>

I'm looking forward to all the patches you are going to provide to 
correct all these (presumably) cPython deficiencies.  When do they start 
arriving on the bug tracker?

-- 
Cheers.

Mark Lawrence.