Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.011 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'languages,': 0.04; 'say,': 0.05; 'subject:Python': 0.06; 'odd': 0.07; 'assuming': 0.09; 'blue': 0.09; 'none)': 0.09; 'okay': 0.09; 'uses.': 0.09; 'cc:addr :python-list': 0.11; 'bug': 0.12; "wouldn't": 0.14; '"e"': 0.16; "(it's": 0.16; '(none,': 0.16; 'buffer,': 0.16; 'buggy': 0.16; 'cares': 0.16; 'grasp': 0.16; 'happy.': 0.16; 'hex': 0.16; 'letters.': 0.16; 'lie,': 0.16; 'running:': 0.16; 'say.': 0.16; 'subject:Could': 0.16; 'subject:Unicode': 0.16; 'text",': 0.16; 'to:addr:pearwood.info': 0.16; 'to:addr:steve+comp.lang.python': 0.16; "to:name:steven d'aprano": 0.16; 'twitter.': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'seems': 0.21; '>>>': 0.22; 'import': 0.22; 'aug': 0.22; 'bonus': 0.22; 'cc:addr:python.org': 0.22; 'skip:l 30': 0.24; 'url:dev': 0.24; 'looks': 0.24; 'question': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'shown': 0.26; 'post': 0.26; 'least': 0.26; 'header:In-Reply- To:1': 0.27; 'point': 0.28; 'appear': 0.29; 'testing': 0.29; 'skip:p 30': 0.29; 'character': 0.29; 'points': 0.29; 'characters': 0.30; 'errors': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; '+0100,': 0.31; 'correctly.': 0.31; "d'aprano": 0.31; 'steven': 0.31; "user's": 0.31; 'languages': 0.32; 'skip:c 30': 0.32; 'text': 0.33; 'linux': 0.33; 'says': 0.33; 'used,': 0.33; 'subject:the': 0.34; "i'd": 0.34; 'problem': 0.35; "can't": 0.35; 'except': 0.35; 'case,': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'there': 0.35; 'version': 0.36; 'really': 0.36; 'combination': 0.36; 'european': 0.36; 'subject:List': 0.36; 'doing': 0.36; 'shows': 0.36; 'subject:?': 0.36; 'should': 0.36; 'application': 0.37; 'button': 0.38; 'server': 0.38; 'needed': 0.38; 'rather': 0.38; 'sure': 0.39; 'enough': 0.39; 'either': 0.39; 'users': 0.40; 'how': 0.40; 'letters': 0.60; 'august': 0.61; 'length': 0.61; 'further': 0.61; 'first': 0.61; 'skip:n 10': 0.64; 'places': 0.64; 'more': 0.64; 'different': 0.65; 'linked': 0.65; 'talking': 0.65; 'due': 0.66; 'here': 0.66; 'six': 0.68; 'combining': 0.68; 'rendering': 0.68; 'press': 0.70; 'to,': 0.72; 'obvious': 0.74; 'article': 0.77; 'counts': 0.83; 'subject:this': 0.83; 'characters,': 0.84; 'compose': 0.84; 'copy-paste': 0.84; 'dozens': 0.84; 'end.': 0.84; 'lying': 0.84; 'pasting': 0.84; 'points,': 0.84; 'western': 0.86; 'subject:you': 0.87; 'smoke': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; bh=MNHupRI/myO8AUVo0CHuzep1UbUG7gJ0JYkgSLagpxw=; b=xveO0jFeojxXL1XbPlIUBHnRKXvnYwuvbLd8VyH/ZkDoWi24INT4L1QggakP0t/uMi DZhHziVcmUUNiHjb4GV5e7KuxdSgy4IU/esqH1zoBd7/isuRWyx8Yvx6VU88dQDTDZqJ f2/UBn9eJJyJ+GcuR6U20UyuaaHOfPW3IveBb+7E0KhDYQZPltXeWtpQMoEJk2qQk8e1 H75XEvKC5hG8bLr0LicG92yNKpZfAEEx465IdM0cEZlf1TvXICf+o2Blg+qBrlNTmLrK WxZRYOWSSuHFmeLo/gUEAL6v2px8eHmMoFZhWnYY7UjJMWiEfh5Z9DuDn5xKk1g7b7rp vYEw== X-Received: by 10.152.19.70 with SMTP id c6mr9424502lae.25.1376214320912; Sun, 11 Aug 2013 02:45:20 -0700 (PDT) MIME-Version: 1.0 Sender: joshua.landau.ws@gmail.com In-Reply-To: <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com> References: <520754d7$0$30000$c3e8da3$5496439d@news.astraweb.com> From: Joshua Landau Date: Sun, 11 Aug 2013 10:44:40 +0100 X-Google-Sender-Auth: gpaFwb2KfesJmIRcKdtlCPGOcjQ Subject: Re: Could you verify this, Oh Great Unicode Experts of the Python-List? To: "Steven D'Aprano" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 89 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1376214330 news.xs4all.nl 15888 [2001:888:2000:d::a6]:46637 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52375 On 11 August 2013 10:09, Steven D'Aprano wrote: > The reason some accented letters have single code point forms is to > support legacy charsets; the reason some only exist as combining > characters is due to the combinational explosion. Some languages allow > you to add up to five or six different accent on any of dozens of > different letters. If each combination needed its own unique code point, > there wouldn't be enough code points. For bonus points, if there are five > accents that can be placed in any combination of zero or more on any of > four characters, how many code points would be needed? 52? > Note that the form you used, b"caf\x65\xCC\x81", is the same as the first > except that you have shown "e" in hex for some reason: > > py> b'\x65' =3D=3D b'e' > True Yeah.. I did that because the linked post did it. I'm not sure why either ;= ). > On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote: >> >> So the solution is: >> >> >>> import unicodedata >> >>> len(unicodedata.normalize("NFC", tweet)) >> 4 > > In this particular case, this will reduce the tweet to the normalised > form that Twitter uses. > > [...] >> After further testing (I don't actually use Twitter) it seems the whole >> thing was just smoke and mirrors. The linked article is a lie, at least >> on the user's end. > > Which linked article? The one on dev.twitter.com seems to be okay to me. That's the one. > Of course, they might be lying when they say "Twitter counts the length > of a Tweet using the Normalization Form C (NFC) version of the text", I > have no idea. But the seem to have a good grasp of the issues involved, > and assuming they do what they say, at least Western European users > should be happy. They *don't* seem to be doing what they say. >> On Linux you can prove this by running: >> >> >>> p =3D subprocess.Popen(['xsel', '-bi'], stdin=3Dsubprocess.PIPE) >> >>> p.communicate(input=3Db"caf\x65\xCC\x81") >> (None, None) >> >> "cafe=CC=81" will be in your Copy-Paste buffer, and you can paste it in = to >> the tweet-box. It takes 5 characters. So much for testing ;). > > How do you know that it takes 5 characters? Is that some Javascript > widget? I'd blame buggy Javascript before Twitter. I go to twitter.com, log in and press that odd blue compose button in the top-right. After pasting at says I have 135 (down from 140) characters left. My only question here is, since you can't post after 140 non-normalised characters, who cares if the server counts it as less? > If this shows up in your application as cafe=CC=81 rather than caf=C3=A9,= it is a > bug in the text rendering engine. Some applications do not deal with > combining characters correctly. Why the rendering engine? > (It's a hard problem to solve, and really needs support from the font. In > some languages, the same accent will appear in different places depending > on the character they are attached to, or the other accents there as > well. Or so I've been lead to believe.) > > >> =C2=B9 https://dev.twitter.com/docs/counting- >> characters#Definition_of_a_Character > > Looks reasonable to me. No obvious errors to my eyes. *Not sure whether talking about the link or my post*