Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Thu, 25 Jul 2013 21:06:21 -0600
From: Michael Torrie <torriem@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:10.0.12) Gecko/20130105 Thunderbird/10.0.12
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: RE Module Performance
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <mailman.5039.1374677274.3114.python-list@python.org> <0420de60-b9b5-4ac4-ba7b-ca5ac2ca65fe@googlegroups.com> <mailman.5090.1374747295.3114.python-list@python.org> <741eaf38-6655-4763-8962-748408e7c2d8@googlegroups.com>
In-Reply-To: <741eaf38-6655-4763-8962-748408e7c2d8@googlegroups.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5126.1374807999.3114.python-list@python.org>
Lines: 30
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51273

On 07/25/2013 01:07 PM, wxjmfauth@gmail.com wrote:
> Let start with a simple string \textemdash or \texttendash
> 
>>>> sys.getsizeof('Ц')
> 40
>>>> sys.getsizeof('a')
> 26

That's meaningless.  You're comparing the overhead of a string object
itself (a one-time cost anyway), not the overhead of storing the actual
characters.  This is the only meaningful comparison:

>>>> sys.getsizeof('ЦЦ') - sys.getsizeof('Ц')

>>>> sys.getsizeof('aa') - sys.getsizeof('a')

Actually I'm not even sure what your point is after all this time of
railing against FSR.  You have said in the past that Python penalizes
users of character sets that require wider byte encodings, but what
would you have us do? use 4-byte characters and penalize everyone
equally?  Use 2-byte characters that incorrectly expose surrogate pairs
for some characters? Use UTF-8 in memory and do O(n) indexing?  Are your
programs (actual programs, not contrived benchmarks) actually slower
because of FSR?  Is FSR incorrect?  If so, according to what part of the
unicode standard?  I'm not trying to troll, or feed the troll.  I'm
actually curious.

I think perhaps you feel that many of us who don't use unicode often
don't understand unicode because some of us don't understand you.  If
so, I'm not sure that's actually true.