Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #52375
| Path | csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <joshua.landau.ws@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.011 |
| X-Spam-Evidence | '*H*': 0.98; '*S*': 0.00; 'languages,': 0.04; 'say,': 0.05; 'subject:Python': 0.06; 'odd': 0.07; 'assuming': 0.09; 'blue': 0.09; 'none)': 0.09; 'okay': 0.09; 'uses.': 0.09; 'cc:addr :python-list': 0.11; 'bug': 0.12; "wouldn't": 0.14; '"e"': 0.16; "(it's": 0.16; '(none,': 0.16; 'buffer,': 0.16; 'buggy': 0.16; 'cares': 0.16; 'grasp': 0.16; 'happy.': 0.16; 'hex': 0.16; 'letters.': 0.16; 'lie,': 0.16; 'running:': 0.16; 'say.': 0.16; 'subject:Could': 0.16; 'subject:Unicode': 0.16; 'text",': 0.16; 'to:addr:pearwood.info': 0.16; 'to:addr:steve+comp.lang.python': 0.16; "to:name:steven d'aprano": 0.16; 'twitter.': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'seems': 0.21; '>>>': 0.22; 'import': 0.22; 'aug': 0.22; 'bonus': 0.22; 'cc:addr:python.org': 0.22; 'skip:l 30': 0.24; 'url:dev': 0.24; 'looks': 0.24; 'question': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'shown': 0.26; 'post': 0.26; 'least': 0.26; 'header:In-Reply- To:1': 0.27; 'point': 0.28; 'appear': 0.29; 'testing': 0.29; 'skip:p 30': 0.29; 'character': 0.29; 'points': 0.29; 'characters': 0.30; 'errors': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; '+0100,': 0.31; 'correctly.': 0.31; "d'aprano": 0.31; 'steven': 0.31; "user's": 0.31; 'languages': 0.32; 'skip:c 30': 0.32; 'text': 0.33; 'linux': 0.33; 'says': 0.33; 'used,': 0.33; 'subject:the': 0.34; "i'd": 0.34; 'problem': 0.35; "can't": 0.35; 'except': 0.35; 'case,': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'there': 0.35; 'version': 0.36; 'really': 0.36; 'combination': 0.36; 'european': 0.36; 'subject:List': 0.36; 'doing': 0.36; 'shows': 0.36; 'subject:?': 0.36; 'should': 0.36; 'application': 0.37; 'button': 0.38; 'server': 0.38; 'needed': 0.38; 'rather': 0.38; 'sure': 0.39; 'enough': 0.39; 'either': 0.39; 'users': 0.40; 'how': 0.40; 'letters': 0.60; 'august': 0.61; 'length': 0.61; 'further': 0.61; 'first': 0.61; 'skip:n 10': 0.64; 'places': 0.64; 'more': 0.64; 'different': 0.65; 'linked': 0.65; 'talking': 0.65; 'due': 0.66; 'here': 0.66; 'six': 0.68; 'combining': 0.68; 'rendering': 0.68; 'press': 0.70; 'to,': 0.72; 'obvious': 0.74; 'article': 0.77; 'counts': 0.83; 'subject:this': 0.83; 'characters,': 0.84; 'compose': 0.84; 'copy-paste': 0.84; 'dozens': 0.84; 'end.': 0.84; 'lying': 0.84; 'pasting': 0.84; 'points,': 0.84; 'western': 0.86; 'subject:you': 0.87; 'smoke': 0.91; '2013': 0.98 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; bh=MNHupRI/myO8AUVo0CHuzep1UbUG7gJ0JYkgSLagpxw=; b=xveO0jFeojxXL1XbPlIUBHnRKXvnYwuvbLd8VyH/ZkDoWi24INT4L1QggakP0t/uMi DZhHziVcmUUNiHjb4GV5e7KuxdSgy4IU/esqH1zoBd7/isuRWyx8Yvx6VU88dQDTDZqJ f2/UBn9eJJyJ+GcuR6U20UyuaaHOfPW3IveBb+7E0KhDYQZPltXeWtpQMoEJk2qQk8e1 H75XEvKC5hG8bLr0LicG92yNKpZfAEEx465IdM0cEZlf1TvXICf+o2Blg+qBrlNTmLrK WxZRYOWSSuHFmeLo/gUEAL6v2px8eHmMoFZhWnYY7UjJMWiEfh5Z9DuDn5xKk1g7b7rp vYEw== |
| X-Received | by 10.152.19.70 with SMTP id c6mr9424502lae.25.1376214320912; Sun, 11 Aug 2013 02:45:20 -0700 (PDT) |
| MIME-Version | 1.0 |
| Sender | joshua.landau.ws@gmail.com |
| In-Reply-To | <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com> |
| References | <mailman.468.1376201912.1251.python-list@python.org> <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com> |
| From | Joshua Landau <joshua@landau.ws> |
| Date | Sun, 11 Aug 2013 10:44:40 +0100 |
| X-Google-Sender-Auth | gpaFwb2KfesJmIRcKdtlCPGOcjQ |
| Subject | Re: Could you verify this, Oh Great Unicode Experts of the Python-List? |
| To | "Steven D'Aprano" <steve+comp.lang.python@pearwood.info> |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | quoted-printable |
| Cc | python-list <python-list@python.org> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.474.1376214330.1251.python-list@python.org> (permalink) |
| Lines | 89 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1376214330 news.xs4all.nl 15888 [2001:888:2000:d::a6]:46637 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:52375 |
Show key headers only | View raw
On 11 August 2013 10:09, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> The reason some accented letters have single code point forms is to
> support legacy charsets; the reason some only exist as combining
> characters is due to the combinational explosion. Some languages allow
> you to add up to five or six different accent on any of dozens of
> different letters. If each combination needed its own unique code point,
> there wouldn't be enough code points. For bonus points, if there are five
> accents that can be placed in any combination of zero or more on any of
> four characters, how many code points would be needed?
52?
> Note that the form you used, b"caf\x65\xCC\x81", is the same as the first
> except that you have shown "e" in hex for some reason:
>
> py> b'\x65' == b'e'
> True
Yeah.. I did that because the linked post did it. I'm not sure why either ;).
> On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote:
>>
>> So the solution is:
>>
>> >>> import unicodedata
>> >>> len(unicodedata.normalize("NFC", tweet))
>> 4
>
> In this particular case, this will reduce the tweet to the normalised
> form that Twitter uses.
>
> [...]
>> After further testing (I don't actually use Twitter) it seems the whole
>> thing was just smoke and mirrors. The linked article is a lie, at least
>> on the user's end.
>
> Which linked article? The one on dev.twitter.com seems to be okay to me.
That's the one.
> Of course, they might be lying when they say "Twitter counts the length
> of a Tweet using the Normalization Form C (NFC) version of the text", I
> have no idea. But the seem to have a good grasp of the issues involved,
> and assuming they do what they say, at least Western European users
> should be happy.
They *don't* seem to be doing what they say.
>> On Linux you can prove this by running:
>>
>> >>> p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
>> >>> p.communicate(input=b"caf\x65\xCC\x81")
>> (None, None)
>>
>> "café" will be in your Copy-Paste buffer, and you can paste it in to
>> the tweet-box. It takes 5 characters. So much for testing ;).
>
> How do you know that it takes 5 characters? Is that some Javascript
> widget? I'd blame buggy Javascript before Twitter.
I go to twitter.com, log in and press that odd blue compose button in
the top-right. After pasting at says I have 135 (down from 140)
characters left.
My only question here is, since you can't post after 140
non-normalised characters, who cares if the server counts it as less?
> If this shows up in your application as café rather than café, it is a
> bug in the text rendering engine. Some applications do not deal with
> combining characters correctly.
Why the rendering engine?
> (It's a hard problem to solve, and really needs support from the font. In
> some languages, the same accent will appear in different places depending
> on the character they are attached to, or the other accents there as
> well. Or so I've been lead to believe.)
>
>
>> ¹ https://dev.twitter.com/docs/counting-
>> characters#Definition_of_a_Character
>
> Looks reasonable to me. No obvious errors to my eyes.
*Not sure whether talking about the link or my post*
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 07:17 +0100
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-11 09:09 +0000
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 10:44 +0100
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-11 11:14 +0000
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Chris Angelico <rosuav@gmail.com> - 2013-08-11 12:45 +0100
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 12:59 +0100
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-13 09:40 +0100
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? wxjmfauth@gmail.com - 2013-08-11 05:51 -0700
Re: Could you verify this, Oh Great Unicode Experts of the Python-List? Joshua Landau <joshua@landau.ws> - 2013-08-11 14:07 +0100
csiph-web