Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CAN1F8qXgBwTGSkbP3N1uJZPJw1CY=4O4ptQurV-2=Gmm4UiYbw@mail.gmail.com>
References: <CAN1F8qXgBwTGSkbP3N1uJZPJw1CY=4O4ptQurV-2=Gmm4UiYbw@mail.gmail.com>
Date: Sun, 11 Aug 2013 07:24:23 +0100
Subject: Re: Could you verify this, Oh Great Unicode Experts of the Python-List?
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.471.1376211637.1251.python-list@python.org>
Lines: 23
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:52371

On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <joshua@landau.ws> wrote:
> Given tweet =3D b"caf\x65\xCC\x81".decode():
>
>     >>> tweet
>     'caf=E9'
>
> But:
>
>     >>> len(tweet)
>     5

You're now looking at the difference between glyphs and combining
characters. Twitter counts combining characters, so when you build one
"thing" out of lots of separately-typed parts, it does count as more
characters.

Read this article for some arguments on the subject, including a
number of references to Twitter itself:

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-leve=
l-abstract-unicode-strings/

ChrisA