Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Mark Lawrence <breamoreboy@yahoo.co.uk>
Subject: Re: Flexible string representation, unicode, typography, ...
Date: Thu, 23 Aug 2012 20:34:29 +0100
References: <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com> <D7udnfbyKvHEqqvNnZ2dnUVZ_sidnZ2d@westnet.com.au> <7eaafbcd-597d-4f8c-98a8-ecb537e6e065@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20120713 Thunderbird/14.0
In-Reply-To: <7eaafbcd-597d-4f8c-98a8-ecb537e6e065@googlegroups.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3731.1345750334.4697.python-list@python.org>
Lines: 73
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27763

On 23/08/2012 19:33, wxjmfauth@gmail.com wrote:
> Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
>> wxjmfauth@gmail.com:
>>
>>
>>
>>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>>
>>> characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
>>
>>> and you will see all the optimization efforts destroyed.
>>
>>>
>>
>>>>> sys.getsizeof('a' * 80 * 50)
>>
>>> 4025
>>
>>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>>> 8040
>>
>>
>>
>>      This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>>   >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>>   >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.
>
> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.
>
> This follows quasi the mathematical logic. To proof a
> law is valid, you have to proof all the cases
> are valid. To proof a law is invalid, just find one
> case showing it.
>
> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)
>
> jmf
>

What do you propose should be used instead, as you appear to be the 
resident expert in the field?

-- 
Cheers.

Mark Lawrence.