Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #52376

Re: Could you verify this, Oh Great Unicode Experts of the Python-List?

References <CAN1F8qXgBwTGSkbP3N1uJZPJw1CY=4O4ptQurV-2=Gmm4UiYbw@mail.gmail.com> <CAPTjJmp1PFP8d8F1KVfaatDA56NszJK3SqO_sGORhhOEeGaJ+w@mail.gmail.com>
From Joshua Landau <joshua@landau.ws>
Date 2013-08-11 10:54 +0100
Subject Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
Newsgroups comp.lang.python
Message-ID <mailman.475.1376214926.1251.python-list@python.org> (permalink)

Show all headers | View raw


On 11 August 2013 07:24, Chris Angelico <rosuav@gmail.com> wrote:
> On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua@landau.ws> wrote:
>> Given tweet = b"caf\x65\xCC\x81".decode():
>>
>>     >>> tweet
>>     'café'
>>
>> But:
>>
>>     >>> len(tweet)
>>     5
>
> You're now looking at the difference between glyphs and combining
> characters. Twitter counts combining characters, so when you build one
> "thing" out of lots of separately-typed parts, it does count as more
> characters.

@https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character
> The "café" issue mentioned above raises the question of how you count
> the characters in the Tweet string "café". To the human eye the length is
> clearly four characters. Depending on how the data is represented this
> could be either five or six UTF-8 bytes. Twitter does not want to penalize
> a user for the fact we use UTF-8 or for the fact that the API client in
> question used the longer representation. Therefore, Twitter does count
> "café" as four characters no matter which representation is sent.

Which would imply that twitter doesn't count combining characters,
even though the web interface seems to.

> Read this article for some arguments on the subject, including a
> number of references to Twitter itself:
>
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

I read that *last* time you pointed it out :P. It's a good link, though.

--
Anyhow, it's good to know I haven't been obviously stupid with my
understanding of Unicode. I learnt it all from this list anyway;
wouldn't want to disappoint!

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 10:54 +0100

csiph-web