Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: joshua.landau.ws@gmail.com
In-Reply-To: <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com>
References: <mailman.468.1376201912.1251.python-list@python.org> <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com>
From: Joshua Landau <joshua@landau.ws>
Date: Sun, 11 Aug 2013 10:44:40 +0100
Subject: Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
To: "Steven D'Aprano" <steve+comp.lang.python@pearwood.info>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: python-list <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.474.1376214330.1251.python-list@python.org>
Lines: 89
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:52375

On 11 August 2013 10:09, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> The reason some accented letters have single code point forms is to
> support legacy charsets; the reason some only exist as combining
> characters is due to the combinational explosion. Some languages allow
> you to add up to five or six different accent on any of dozens of
> different letters. If each combination needed its own unique code point,
> there wouldn't be enough code points. For bonus points, if there are five
> accents that can be placed in any combination of zero or more on any of
> four characters, how many code points would be needed?

52?

> Note that the form you used, b"caf\x65\xCC\x81", is the same as the first
> except that you have shown "e" in hex for some reason:
>
> py> b'\x65' =3D=3D b'e'
> True

Yeah.. I did that because the linked post did it. I'm not sure why either ;=
).

> On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote:
>>
>> So the solution is:
>>
>>     >>> import unicodedata
>>     >>> len(unicodedata.normalize("NFC", tweet))
>>     4
>
> In this particular case, this will reduce the tweet to the normalised
> form that Twitter uses.
>
> [...]
>> After further testing (I don't actually use Twitter) it seems the whole
>> thing was just smoke and mirrors. The linked article is a lie, at least
>> on the user's end.
>
> Which linked article? The one on dev.twitter.com seems to be okay to me.

That's the one.

> Of course, they might be lying when they say "Twitter counts the length
> of a Tweet using the Normalization Form C (NFC) version of the text", I
> have no idea. But the seem to have a good grasp of the issues involved,
> and assuming they do what they say, at least Western European users
> should be happy.

They *don't* seem to be doing what they say.

>> On Linux you can prove this by running:
>>
>>     >>> p =3D subprocess.Popen(['xsel', '-bi'], stdin=3Dsubprocess.PIPE)
>>     >>> p.communicate(input=3Db"caf\x65\xCC\x81")
>>     (None, None)
>>
>> "cafe=CC=81" will be in your Copy-Paste buffer, and you can paste it in =
to
>> the tweet-box. It takes 5 characters. So much for testing ;).
>
> How do you know that it takes 5 characters? Is that some Javascript
> widget? I'd blame buggy Javascript before Twitter.

I go to twitter.com, log in and press that odd blue compose button in
the top-right. After pasting at says I have 135 (down from 140)
characters left.

My only question here is, since you can't post after 140
non-normalised characters, who cares if the server counts it as less?

> If this shows up in your application as cafe=CC=81 rather than caf=C3=A9,=
 it is a
> bug in the text rendering engine. Some applications do not deal with
> combining characters correctly.

Why the rendering engine?

> (It's a hard problem to solve, and really needs support from the font. In
> some languages, the same accent will appear in different places depending
> on the character they are attached to, or the other accents there as
> well. Or so I've been lead to believe.)
>
>
>> =C2=B9 https://dev.twitter.com/docs/counting-
>> characters#Definition_of_a_Character
>
> Looks reasonable to me. No obvious errors to my eyes.

*Not sure whether talking about the link or my post*