Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.006 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'ascii': 0.07; 'bytes.': 0.07; 'character,': 0.07; 'utf-8': 0.07; 'python': 0.09; 'backwards': 0.09; 'expense': 0.09; 'indexes': 0.09; 'received :mail-lpp01m010-f46.google.com': 0.09; 'subject:string': 0.09; 'substring': 0.09; 'aug': 0.13; 'index': 0.13; 'language': 0.14; 'sat,': 0.15; 'bend': 0.16; 'reason).': 0.16; 'subject:unicode': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'unicode': 0.17; 'discussion': 0.20; 'split': 0.23; 'this:': 0.23; 'header :In-Reply-To:1': 0.25; 'looks': 0.26; 'am,': 0.27; 'message- id:@mail.gmail.com': 0.27; 'actual': 0.28; 'arrays': 0.29; 'strings,': 0.29; 'related': 0.30; 'received:209.85.215.46': 0.30; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'received:209.85': 0.35; 'something': 0.35; 'except': 0.36; 'but': 0.36; 'characters': 0.36; 'two': 0.37; 'quite': 0.37; 'received:209': 0.37; 'far': 0.37; 'subject:: ': 0.38; 'skip:l 20': 0.38; 'some': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'your': 0.60; 'subject:, ': 0.61; 'containing': 0.61; 'first': 0.61; 'subject:...': 0.63; 'more': 0.63; 'subject.': 0.65; 'middle': 0.66; 'sounds': 0.71; 'chinese': 0.78; '8bit%:24': 0.84; 'charset:big5': 0.84; 'reminds': 0.84; 'subject:, ...': 0.84; 'to:name:python': 0.84; 'seriously,': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=9a+8QlXcM/bdhkklwLhZbT+Id2tdu5svEeNMamuguF4=; b=0pbSJXOsdg1yUxG1xWcso+lca2Wba1LEQjF2fQjE601+kKE6oQaPzSYwPhYrBkgmDu zPQx0bHzzjdq7+hdKMADB3mUf2cj+BVtyQRi82pVo0PDzQlIOupv/eOtfqaAu+rP+5hR W7c+0w8MWqEKYk2WX+3Kj0DNZ0acEsxCOSm11RpegUwDHS2JRPTh71cprmK/QiHGdD3a TQQuH+1bOypNgChGL3vhPsrRKRf2DtfWRWY0bwZUrAazp95WLlwobwfJT5ILIjXr7UiH Sfk6zPQUC2wqunwiVzl3yXNFXEvKz8w+g87bu3kho6LedsRO+3udR7zXlG2X+izy0X7g S0+g== MIME-Version: 1.0 In-Reply-To: References: <1874857c-68ef-4c1b-b15a-46ef47df9445@googlegroups.com> <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> From: Ian Kelly Date: Sat, 25 Aug 2012 16:26:56 -0600 Subject: Re: Flexible string representation, unicode, typography, ... To: Python Content-Type: text/plain; charset=Big5 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 28 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1345933655 news.xs4all.nl 6897 [2001:888:2000:d::a6]:46222 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:27888 On Sat, Aug 25, 2012 at 9:47 AM, wrote: > For those you do not know, the go language has introduced > the rune type. As far as I know, nobody is complaining, I > have not even seen a discussion related to this subject. Python has that also. We call it "int". More seriously, strings in Go are not sequences of runes. They're actually arrays of UTF-8 bytes. That means that they're quite efficient for ASCII strings, at the expense of other characters, like Chinese (wait, this sounds familiar for some reason). It also means that you have to bend over backwards if you want to work with actual runes instead of bytes. Want to know how many characters are in your string? Don't call len() on it -- that will only tell you how many bytes are in it. Don't try to index or slice it either -- that will (accidentally) work for ASCII strings, but for other strings your indexes will be wrong. If you're unlucky you might even split up the string in the middle of a character, and now your string has invalid characters in it. The right way to do it looks something like this: len([]rune("=A5=D5=C4P=B5=BE")) // get the length of the string in charact= ers string([]rune("=A5=D5=C4P=B5=BE")[0:2]) // get the substring containing th= e first two characters It reminds me of working in Python 2.X, except that instead of an actual unicode type you just have arrays of ints.