Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Mon, 01 Apr 2013 18:53:44 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Performance of int/long in Python 3
References: <mailman.3703.1364248275.2939.python-list@python.org> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <mailman.3845.1364441182.2939.python-list@python.org> <d2cc443a-e049-42ed-abc6-66b5ea600fe7@j1g2000pbq.googlegroups.com> <5153d313$0$29984$c3e8da3$5496439d@news.astraweb.com> <0b6be19c-ff11-4e24-a7dc-fec0af411393@kw7g2000pbb.googlegroups.com> <5153f5ce$0$29984$c3e8da3$5496439d@news.astraweb.com> <fca3aa1f-86a7-4c96-88c2-893ec95ba306@googlegroups.com> <mailman.3914.1364503941.2939.python-list@python.org> <11ef1d36-0783-4cb2-b29f-9ae573ed7e47@googlegroups.com> <mailman.3975.1364598987.2939.python-list@python.org> <f34a1c9a-57a3-45fd-b7e0-c33d06e02335@j9g2000vbz.googlegroups.com> <mailman.4013.1364734495.2939.python-list@python.org> <6a146aba-a032-4aac-b2d3-7acedcebd804@q3g2000pbv.googlegroups.com> <515941d8$0$29967$c3e8da3$5496439d@news.astraweb.com> <roy-04FB84.08155301042013@news.panix.com> <5159beb6$0$29967$c3e8da3$5496439d@news.astraweb.com>
In-Reply-To: <5159beb6$0$29967$c3e8da3$5496439d@news.astraweb.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.5.1364838820.17481.python-list@python.org>
Lines: 49
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:42475

On 01/04/2013 18:07, Steven D'Aprano wrote:
> On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:
>
>> In article <515941d8$0$29967$c3e8da3$5496439d@news.astraweb.com>,
>>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>>
>>> [...]
>>> >> OK, that leads to the next question.  Is there anyway I can (in
>>> >> Python 2.7) detect when a string is not entirely in the BMP?  If I
>>> >> could find all the non-BMP characters, I could replace them with
>>> >> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).
>>>
>>> Of course you can do this, but you should not. If your input data
>>> includes character C, you should deal with character C and not just
>>> throw it away unnecessarily. That would be rude, and in Python 3.3 it
>>> should be unnecessary.
>>
>> The import job isn't done yet, but so far we've processed 116 million
>> records and had to clean up four of them.  I can live with that.
>> Sometimes practicality trumps correctness.
>
> Well, true. It has to be said that few programming languages (and
> databases) make it easy to do the right thing. On the other hand, you're
> a programmer. Your job is to write correct code, not easy code.
>
>
>> It turns out, the problem is that the version of MySQL we're using
>
> Well there you go. Why don't you use a real database?
>
> http://www.postgresql.org/docs/9.2/static/multibyte.html
>
> :-)
>
> Postgresql has supported non-broken UTF-8 since at least version 8.1.
>
>
>> doesn't support non-BMP characters.  Newer versions do (but you have to
>> declare the column to use the utf8bm4 character set).  I could upgrade
>> to a newer MySQL version, but it's just not worth it.
>
> My brain just broke. So-called "UTF-8" in MySQL only includes up to a
> maximum of three-byte characters. There has *never* been a time where
> UTF-8 excluded four-byte characters. What were the developers thinking,
> arbitrarily cutting out support for 50% of UTF-8?
>
[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?