Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'operator': 0.03; 'encoding': 0.05; 'memory.': 0.07; 'padding': 0.07; 'problem:': 0.07; 'utf-8': 0.07; 'string': 0.09; '*is*': 0.09; 'ascii': 0.09; 'assuming': 0.09; 'bits': 0.09; 'compact': 0.09; 'escape': 0.09; 'false.': 0.09; 'lost.': 0.09; 'so?': 0.09; 'strings.': 0.09; 'things,': 0.09; 'width': 0.09; 'python': 0.11; 'stored': 0.12; 'mostly': 0.14; 'question.': 0.14; 'differs': 0.16; 'encoding.': 0.16; 'encodings,': 0.16; 'lookups.': 0.16; 'non-ascii': 0.16; 'notation,': 0.16; 'slow,': 0.16; 'subject:unicode': 0.16; 'unicode.': 0.16; 'unimportant': 0.16; 'zeros': 0.16; '\xe9crit': 0.16; 'wrote:': 0.18; 'do.': 0.18; 'variable': 0.18; 'bit': 0.19; 'mechanism': 0.19; 'seems': 0.21; '8bit%:5': 0.22; '>>>': 0.22; 'appears': 0.22; 'programming': 0.22; 'coding': 0.22; 'header :User-Agent:1': 0.23; 'byte': 0.24; 'bytes': 0.24; 'char': 0.24; 'string,': 0.24; 'unicode': 0.24; 'decide': 0.24; 'environment': 0.24; 'compare': 0.26; 'least': 0.26; 'header:In-Reply-To:1': 0.27; 'michael': 0.29; 'feature': 0.29; 'character': 0.29; 'points': 0.29; 'characters': 0.30; "i'm": 0.30; 'gives': 0.31; 'code': 0.31; "skip:' 10": 0.31; '3.2': 0.31; '>>>>': 0.31; 'argue': 0.31; 'claiming': 0.31; 'context,': 0.31; 'long.': 0.31; 'this.': 0.32; 'languages': 0.32; 'me?': 0.32; 'stuff': 0.32; 'quite': 0.32; 'cases': 0.33; 'not.': 0.33; 'totally': 0.33; 'trouble': 0.34; 'tool': 0.35; 'definition': 0.35; 'point.': 0.35; 'but': 0.35; 'object,': 0.36; 'doing': 0.36; 'should': 0.36; 'wrong': 0.37; 'application': 0.37; 'too': 0.37; 'being': 0.38; 'handle': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'track': 0.38; 'to:addr:python.org': 0.39; '8bit%:6': 0.40; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'most': 0.60; 'tell': 0.60; 'new': 0.61; 'details.': 0.61; 'skip:* 10': 0.61; "you're": 0.61; 'email addr:gmail.com': 0.63; 'information': 0.63; 'kind': 0.63; 'such': 0.63; 'soon': 0.63; 'decided': 0.64; 'more': 0.64; '(that': 0.65; 'talking': 0.65; 'teach': 0.65; 'charset:windows-1252': 0.65; 'details': 0.65; 'within': 0.65; 'compliant': 0.68; 'invalid': 0.68; 'wish': 0.70; 'secret': 0.74; 'goal': 0.75; 'internally.': 0.84; 'irrelevant': 0.84; 'pardon': 0.84; 'received:195.238': 0.84; 'received:195.238.6': 0.84; 'received:belgacom.be': 0.84; 'received:isp.belgacom.be': 0.84; 'standards,': 0.84; 'unacceptable': 0.84; 'secret.': 0.91; '2013': 0.98 X-Belgacom-Dynamic: yes X-Cloudmark-SP-Filtered: true X-Cloudmark-SP-Result: v=1.1 cv=zPMwBkkPfiSbYLp7I0z+lR1wkS32w+GEFGpOlQDzoSM= c=1 sm=2 a=a21qFd9_iNQA:10 a=Euh8oaAuAMwA:10 a=N659UExz7-8A:10 a=pGLkceISAAAA:8 a=FwzW76rxYtXBTFgvGmIA:9 a=pILNOxqGKmIA:10 a=MSl-tDqOz04A:10 a=FIGjeTisT3T9rjO8:21 a=SoJJSgrKY_8Vuhw_:21 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApMBAOF29VFtgEev/2dsb2JhbAANTsFdgS2DGAEBAQMBMgEFQAYLCxgJFg8JAwIBAgEPNhMGAgKHegMJpVmIag1XiAeNFYE5gTaEBQOVdoFpjCaIPIFv Date: Sun, 28 Jul 2013 21:55:50 +0200 From: Antoon Pardon User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130704 Icedove/17.0.7 MIME-Version: 1.0 To: python-list@python.org Subject: Re: FSR and unicode compliance - was Re: RE Module Performance References: <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> <8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 154 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1375041352 news.xs4all.nl 15864 [2001:888:2000:d::a6]:41592 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:51404 Op 28-07-13 21:23, wxjmfauth@gmail.com schreef: > Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit : >> On 07/27/2013 12:21 PM, wxjmfauth@gmail.com wrote: >> >>> Good point. FSR, nice tool for those who wish to teach >> >>> Unicode. It is not every day, one has such an opportunity. >> >> >> >> I had a long e-mail composed, but decided to chop it down, but still too >> >> long. so I ditched a lot of the context, which jmf also seems to do. >> >> Apologies. >> >> >> >> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32 >> >> is an official encoding. FSR only differs from UTF-32 in that the >> >> padding zeros are stripped off such that it is stored in the most >> >> compact form that can handle all the characters in string, which is >> >> always known at string creation time. Now you can argue many things, >> >> but to say FSR is not unicode compliant is quite a stretch! What >> >> unicode entities or characters cannot be stored in strings using FSR? >> >> What sequences of bytes in FSR result in invalid Unicode entities? >> >> >> >> 2. strings in Python *never change*. They are immutable. The + >> >> operator always copies strings character by character into a new string >> >> object, even if Python had used UTF-8 internally. If you're doing a lot >> >> of string concatenations, perhaps you're using the wrong data type. A >> >> byte buffer might be better for you, where you can stuff utf-8 sequences >> >> into it to your heart's content. >> >> >> >> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that >> >> slicing a string would be very very slow, and that's unacceptable for >> >> the use cases of python strings. I'm assuming you understand big O >> >> notation, as you talk of experience in many languages over the years. >> >> FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any >> >> variable-width encoding are always O(n). A lot slower! >> >> >> >> 4. Unicode is, well, unicode. You seem to hop all over the place from >> >> talking about code points to bytes to bits, using them all >> >> interchangeably. And now you seem to be claiming that a particular byte >> >> encoding standard is by definition unicode (UTF-8). Or at least that's >> >> how it sounds. And also claim FSR is not compliant with unicode >> >> standards, which appears to me to be completely false. >> >> >> >> Is my understanding of these things wrong? > > ------ > > Compare these (a BDFL exemple, where I'using a non-ascii char) > > Py 3.2 (narrow build) > >>>> timeit.timeit("a = 'hundred'; 'x' in a") > 0.09897159682121348 >>>> timeit.timeit("a = 'hundre€'; 'x' in a") > 0.09079501961732461 >>>> sys.getsizeof('d') > 32 >>>> sys.getsizeof('€') > 32 >>>> sys.getsizeof('dd') > 34 >>>> sys.getsizeof('d€') > 34 > > > Py3.3 > >>>> timeit.timeit("a = 'hundred'; 'x' in a") > 0.12183182740848858 >>>> timeit.timeit("a = 'hundre€'; 'x' in a") > 0.2365732969632326 >>>> sys.getsizeof('d') > 26 >>>> sys.getsizeof('€') > 40 >>>> sys.getsizeof('dd') > 27 >>>> sys.getsizeof('d€') > 42 > > Tell me which one seems to be more "unicode compliant"? Cant tell, you give no relevant information on which one can decide this question. > The goal of Unicode is to handle every char "equaly". Not to this kind of detail, which is looking at irrelevant implementation details. > Now, the problem: memory. Do not forget that à la "FSR" > mechanism for a non-ascii user is *irrelevant*. As > soon as one uses one single non-ascii, your ascii feature > is lost. (That why we have all these dedicated coding > schemes, utfs included). So? Why should that trouble me? As far as I understand whether I have an ascii string or not is totally irrelevant to the application programmer. Within the application I just process strings and let the programming environment keep track of these details in a transparant way unless you start looking at things like getsizeof, which gives you implementation details that are mostly irrelevant in deciding whether the behaviour is compliant or not. >>>> sys.getsizeof('abc' * 1000 + 'z') > 3026 >>>> sys.getsizeof('abc' * 1000 + '\U00010010') > 12044 > > A bit secret. The larger a repertoire of characters > is, the more bits you needs. > Secret #2. You can not escape from this. And totally unimportant for deciding complyance. -- Antoon Pardon