Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.04; 'encoding': 0.05; 'subject:Python': 0.06; 'python3': 0.07; 'referring': 0.07; 'sys': 0.07; 'bytes.': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'terms,': 0.09; 'python': 0.11; '07:50,': 0.16; 'encoding.': 0.16; 'matters,': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'utf8': 0.16; 'wrote:': 0.18; 'bit': 0.19; "python's": 0.19; 'import': 0.22; 'header:User- Agent:1': 0.23; 'received:comcast.net': 0.24; 'unicode': 0.24; '(or': 0.24; 'skip:v 30': 0.26; 'skip:_ 20': 0.27; 'header:X -Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'skip:p 30': 0.29; 'am,': 0.29; 'skip:( 20': 0.30; 'code': 0.31; 'assert': 0.31; 'waters': 0.31; 'another': 0.32; 'but': 0.35; 'really': 0.36; 'sequence': 0.36; 'two': 0.37; 'mapping': 0.38; 'nov': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'called': 0.40; 'even': 0.60; "you'll": 0.62; 'email addr:gmail.com': 0.63; 'more': 0.64; 'between': 0.67; 'subjectcharset:utf-8': 0.72; '7:00': 0.84; 'confusing': 0.84; 'isolate': 0.84; 'subject::': 0.85; '2013,': 0.91 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Ned Batchelder Subject: Re: =?UTF-8?B?J1N0cmHDn2UnICgnU3RyYXNzZScpIGFuZCBQeXRob24gMg==?= Date: Wed, 15 Jan 2014 07:13:36 -0500 References: <30dfa6f1-61b2-49b8-bc65-5fd18d498c38@googlegroups.com> <52D67873.2010502@chamonix.reportlab.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Gmane-NNTP-Posting-Host: c-50-133-228-126.hsd1.ma.comcast.net User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 In-Reply-To: <52D67873.2010502@chamonix.reportlab.co.uk> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 45 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1389788028 news.xs4all.nl 2963 [2001:888:2000:d::a6]:59667 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:63974 On 1/15/14 7:00 AM, Robin Becker wrote: > On 12/01/2014 07:50, wxjmfauth@gmail.com wrote: >>>>> sys.version >> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] >>>>> s = 'Straße' >>>>> assert len(s) == 6 >>>>> assert s[5] == 'e' >>>>> >> >> jmf >> > > On my utf8 based system > > >> robin@everest ~: >> $ cat ooo.py >> if __name__=='__main__': >> import sys >> s='A̅B' >> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s))) >> robin@everest ~: >> $ python ooo.py >> version_info=sys.version_info(major=3, minor=3, micro=3, >> releaselevel='final', serial=0) >> len(A̅B)=3 >> robin@everest ~: >> $ > > > so two 'characters' are 3 (or 2 or more) codepoints. If I want to > isolate so called graphemes I need an algorithm even for python's > unicode ie when it really matters, python3 str is just another encoding. You are right that more than one codepoint makes up a grapheme, and that you'll need code to deal with the correspondence between them. But let's not muddy these already confusing waters by referring to that mapping as an encoding. In Unicode terms, an encoding is a mapping between codepoints and bytes. Python 3's str is a sequence of codepoints. -- Ned Batchelder, http://nedbatchelder.com