Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'represents': 0.05; 'string.': 0.05; 'subject:Python': 0.06; 'extent': 0.07; 'utf-8': 0.07; 'string': 0.09; 'arrays': 0.09; 'ascii': 0.09; 'bytes.': 0.09; 'character,': 0.09; 'converts': 0.09; 'false.': 0.09; 'strings.': 0.09; 'subject: [': 0.09; 'api': 0.11; 'python': 0.11; '*always*': 0.16; 'character.': 0.16; 'corresponds': 0.16; 'did.': 0.16; 'encodings': 0.16; 'endian': 0.16; 'internally': 0.16; 'inverse': 0.16; 'invisible': 0.16; 'only)': 0.16; 'safely.': 0.16; 'stuff.': 0.16; 'truncate': 0.16; 'truncation': 0.16; 'unicode.': 0.16; 'variants': 0.16; 'wrote:': 0.18; 'pointed': 0.19; "python's": 0.19; 'thu,': 0.19; 'user.': 0.19; '(the': 0.22; 'putting': 0.22; 'byte': 0.24; 'bytes': 0.24; 'case.': 0.24; 'oriented': 0.24; 'unicode': 0.24; 'java': 0.24; 'regardless': 0.24; "i've": 0.25; 'define': 0.26; 'subject:/': 0.26; 'defined': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'function': 0.29; 'am,': 0.29; 'array': 0.29; 'character': 0.29; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; 'code': 0.31; 'decimal': 0.31; '-----': 0.33; 'actual': 0.34; 'could': 0.34; 'problem': 0.35; 'received:209.85': 0.35; 'one,': 0.35; 'point.': 0.35; 'prepare': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'possible': 0.36; 'received:209': 0.37; 'being': 0.38; 'implement': 0.38; 'represent': 0.38; 'product.': 0.38; 'subject:]': 0.38; 'to:addr:python-list': 0.38; 'previous': 0.38; 'little': 0.38; 'explain': 0.39; 'does': 0.39; 'to:addr:python.org': 0.39; 'either': 0.39; 'major': 0.40; 'how': 0.40; 'ensure': 0.60; 'even': 0.60; 'algorithms': 0.60; 'consists': 0.60; 'ian': 0.60; 'length': 0.61; 'mentioned': 0.61; 'new': 0.61; 'simply': 0.61; 'discuss': 0.62; 'back': 0.62; 'map': 0.64; 'therefore,': 0.64; 'more': 0.64; 'mar': 0.68; 'characters,': 0.84; 'divide': 0.84; 'dry': 0.84; 'subject:long': 0.84; '2013': 0.98 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:x-received:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding :x-gm-message-state; bh=hF0IqBXGCJTViqP66p5PtMV9q3GxCB0dX7HGhx3i+kk=; b=Qvgz3X7muwCqym0RGAdSV+L0klk1uXZCd08qS1RXIVQr4gzRAhrU6Q9aVjWA/nMPfb wlMYVoFHZ6NsXirhfotejMmKMXg1y2spaK1deRbBQHv0tZbK1+xecn3EY+alkJ2EdrXx 401fQEIndRLvEaeV+EsRBJ3DsuNQI5AWTY/n7ja4B0GNHER5IhxgN4w8XGXR03lqMZOd QmZI7Z+aK6QZsD9w2mcYRdWXOYTO9Ol94dR/J6ZeT6vPNiefvvysbSYgADjv93Pi72DM xE5L2UeFJPEC6GQsmr2xqBlbySE/2/lIEkEMs3u90iwjM90H6QlWwfljDD/gwBxiGSOI 8XRQ== X-Received: by 10.60.12.226 with SMTP id b2mr17019740oec.76.1364502564221; Thu, 28 Mar 2013 13:29:24 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.12.226 with SMTP id b2mr17019739oec.76.1364502564143; Thu, 28 Mar 2013 13:29:24 -0700 (PDT) In-Reply-To: <7f993624-8105-4055-a268-3417e5fe21dc@g4g2000yqd.googlegroups.com> References: <0b779c80-4f50-4716-8c30-47755c15f304@m12g2000yqp.googlegroups.com> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <7f993624-8105-4055-a268-3417e5fe21dc@g4g2000yqd.googlegroups.com> Date: Thu, 28 Mar 2013 13:29:23 -0700 Subject: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3] From: Benjamin Kaplan To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQleUaXe8dW63l69iTIn78wFAxhV/rfRHyPmul1tOQcEpDgog7F5sr4dkUIcowAruvAEuBhtE7Aw2RQGunx2Hik4IjAOgxrrFyMSnp7s08xfcuPtwpf0qXk8te2Wfz/dxrAgeLRo1LjGLmxq71ZDKr/wWYuOMg== X-Junkmail-Whitelist: YES (by domain whitelist at mpv2.tis.cwru.edu) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 75 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1364502948 news.xs4all.nl 6904 [2001:888:2000:d::a6]:53820 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:42190 On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wrote: > On 28 mar, 17:33, Ian Kelly wrote: >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wrote: >> > The flexible string representation takes the problem from the >> > other side, it attempts to work with the characters by using >> > their representations and it (can only) fails... >> >> This is false. As I've pointed out to you before, the FSR does not >> divide characters up by representation. It divides them up by >> codepoint -- more specifically, by the *bit-width* of the codepoint. >> We call the internal format of the string "ASCII" or "Latin-1" or >> "UCS-2" for conciseness and a point of reference, but fundamentally >> all of the FSR formats are simply byte arrays of *codepoints* -- you >> know, those things you keep harping on. The major optimization >> performed by the FSR is to consistently truncate the leading zero >> bytes from each codepoint when it is possible to do so safely. But >> regardless of to what extent this truncation is applied, the string is >> *always* internally just an array of codepoints, and the same >> algorithms apply for all representations. > > ----- > > You know, we can discuss this ad nauseam. What is important > is Unicode. > > You have transformed Python back in an ascii oriented product. > > If Python had imlemented Unicode correctly, there would > be no difference in using an "a", "=C3=A9", "=E2=82=AC" or any character, > what the narrow builds did. > > If I am practically the only one, who speakes /discusses about > this, I can ensure you, this has been noticed. > > Now, it's time to prepare the Asparagus, the "jambon cru" > and a good bottle a dry white wine. > > jmf > > You still have yet to explain how Python's string representation is wrong. Just how it isn't optimal for one specific case. Here's how I understand it: 1) Strings are sequences of stuff. Generally, we talk about strings as either sequences of bytes or sequences of characters. 2) Unicode is a format used to represent characters. Therefore, Unicode strings are character strings, not byte strings. 2) Encodings are functions that map characters to bytes. They typically also define an inverse function that converts from bytes back to characters. 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I mentioned in the previous point. It happens to be one of the five standard encodings that is defined for all characters in the Unicode standard (the others being the little and big endian variants of UTF-16 and UTF-32). 4) The internal representation of a character string DOES NOT MATTER. All that matters is that the API represents it as a string of characters, regardless of the representation. We could implement character strings by putting the Unicode code-points in binary-coded decimal and it would be a Unicode character string. 5) The String type that .NET and Java (and unicode type in Python narrow builds) use is not a character string. It is a string of shorts, each of which corresponds to a UTF-16 code point. I know this is the case because in all of these, the length of "\u1f435" is 2 even though it only consists of one character. 6) The new string representation in Python 3.3 can successfully represent all characters in the Unicode standard. The actual number of bytes that each character consumes is invisible to the user.