Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #52376
| References | <CAN1F8qXgBwTGSkbP3N1uJZPJw1CY=4O4ptQurV-2=Gmm4UiYbw@mail.gmail.com> <CAPTjJmp1PFP8d8F1KVfaatDA56NszJK3SqO_sGORhhOEeGaJ+w@mail.gmail.com> |
|---|---|
| From | Joshua Landau <joshua@landau.ws> |
| Date | 2013-08-11 10:54 +0100 |
| Subject | Re: Could you verify this, Oh Great Unicode Experts of the Python-List? |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.475.1376214926.1251.python-list@python.org> (permalink) |
On 11 August 2013 07:24, Chris Angelico <rosuav@gmail.com> wrote: > On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua@landau.ws> wrote: >> Given tweet = b"caf\x65\xCC\x81".decode(): >> >> >>> tweet >> 'café' >> >> But: >> >> >>> len(tweet) >> 5 > > You're now looking at the difference between glyphs and combining > characters. Twitter counts combining characters, so when you build one > "thing" out of lots of separately-typed parts, it does count as more > characters. @https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character > The "café" issue mentioned above raises the question of how you count > the characters in the Tweet string "café". To the human eye the length is > clearly four characters. Depending on how the data is represented this > could be either five or six UTF-8 bytes. Twitter does not want to penalize > a user for the fact we use UTF-8 or for the fact that the API client in > question used the longer representation. Therefore, Twitter does count > "café" as four characters no matter which representation is sent. Which would imply that twitter doesn't count combining characters, even though the web interface seems to. > Read this article for some arguments on the subject, including a > number of references to Twitter itself: > > http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ I read that *last* time you pointed it out :P. It's a good link, though. -- Anyhow, it's good to know I haven't been obviously stupid with my understanding of Unicode. I learnt it all from this list anyway; wouldn't want to disappoint!
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 10:54 +0100
csiph-web