Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53877

Re: Chardet, file, ... and the Flexible String Representation

Path csiph.com!usenet.pasdenom.info!news.etla.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <nedbat@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; '"this': 0.03; 'string.': 0.05; 'problem?': 0.07; 'subject:file': 0.07; 'string': 0.09; "'a'": 0.09; 'bytes.': 0.09; 'chunk': 0.09; 'complicate': 0.09; 'expected.': 0.09; 'methods,': 0.09; 'pgp': 0.09; 'random': 0.14; '"a"': 0.16; '"internal': 0.16; '[*]': 0.16; 'chunks': 0.16; 'concatenate': 0.16; 'fails.': 0.16; 'none.': 0.16; 'splits': 0.16; 'subject:String': 0.16; '\xe9crit': 0.16; 'comment:': 0.16; 'demonstrate': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'basically': 0.19; 'split': 0.19; 'examples': 0.20; '(the': 0.22; '8bit%:5': 0.22; '>>>': 0.22; 'memory': 0.22; '(in': 0.22; 'coding': 0.22; 'header:User-Agent:1': 0.23; 'bytes': 0.24; 'char': 0.24; 'unicode': 0.24; 'compare': 0.26; 'handling': 0.26; '----------': 0.26; 'van': 0.27; 'header:In-Reply-To:1': 0.27; '----': 0.29; 'am,': 0.29; 'array': 0.29; 'character': 0.29; 'points': 0.29; "doesn't": 0.30; 'characters': 0.30; 'code': 0.31; 'reply.': 0.31; "skip:' 10": 0.31; '>>>>': 0.31; 'sets.': 0.31; 'writes:': 0.31; 'this.': 0.32; 'handled': 0.32; 'supposed': 0.32; 'another': 0.32; '-----': 0.33; 'subject:the': 0.34; 'could': 0.34; 'problem': 0.35; 'problem.': 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'doing': 0.36; 'thanks': 0.36; 'wrong': 0.37; 'two': 0.37; 'problems': 0.38; 'stopped': 0.38; 'skip:[ 10': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; '8bit%:6': 0.40; 'solve': 0.60; 'full': 0.61; 'simply': 0.61; 'further': 0.61; "you've": 0.63; 'email addr:gmail.com': 0.63; 'show': 0.63; 'kind': 0.63; 'real': 0.63; 'such': 0.63; 'happen': 0.63; 'face': 0.64; 'more': 0.64; 'peace': 0.65; 'charset:windows-1252': 0.65; 'due': 0.66; 'direct': 0.67; 'side': 0.67; 'exceed': 0.68; 'therefore': 0.72; 'behavior': 0.77; 'effects,': 0.84; 'differences': 0.93; 'contrary': 0.95; '2013': 0.98
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=EJYLI5VygHiGXdgm8anSTtczvrle8z0MF188GrLbyHo=; b=NteGR1vh1gTRl5A8uyuLIjY1EGnWO82UERSYAua1gDojnDRPtB++B9rRsLEN7cqtZz BqGunUhVSqF3MmVb8RRqu9Lo4yFWIunk1k5sdE6mAHYQ+7Hq5zNai4kYSKK5TSEkHLV3 ZFZ0P31cimILUHDD0YF8+RU8vhNXEiqUnFfE5/nFuoBC1q07pDcL1dvG8eR3s7hyZavs p2WXLSc64275uymKCkuOPuNWr8qcnn0DCjdsRVEfrPYqg/lJ1Pd2CGQvd5Ze9v8GaCa0 UATu3VWmYAuALcKCV7AIKejFkPiEb6fSJVU5IZP3F1YQncS9cuJ2rF5EYGglGPVZTzLX j8tg==
X-Received by 10.58.201.69 with SMTP id jy5mr934075vec.29.1378744706836; Mon, 09 Sep 2013 09:38:26 -0700 (PDT)
Sender Ned Batchelder <nedbat@gmail.com>
Date Mon, 09 Sep 2013 12:38:26 -0400
From Ned Batchelder <ned@nedbatchelder.com>
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version 1.0
To python-list@python.org
Subject Re: Chardet, file, ... and the Flexible String Representation
References <4ce85ea8-4a4c-46cf-a546-ad999576a5f7@googlegroups.com> <m2a9jqq7g9.fsf@cochabamba.vanoostrum.org> <04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com>
In-Reply-To <04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com>
Content-Type text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding 8bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.184.1378744709.5461.python-list@python.org> (permalink)
Lines 137
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1378744709 news.xs4all.nl 16000 [2001:888:2000:d::a6]:48250
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:53877

Show key headers only | View raw


On 9/9/13 10:28 AM, wxjmfauth@gmail.com wrote:
> Le vendredi 6 septembre 2013 17:46:14 UTC+2, Piet van Oostrum a écrit :
>> wxjmfauth@gmail.com writes:
>>
>>
>>
>>> The Flexible String Representation has conceptually to
>>> face the same problem. It splits "unicode" in chunks and
>>> it has to solve two problems at the same time, the coding
>>> and the handling of multiple "char sets". The problem?
>>> It fails.
>>> "This poor Flexible String Representation does not succeed
>>> to solve the problem it create itsself."
>>
>>
>> The FSR does not split unicode in chuncks. It does not create problems and therefore it doesn't have to solve this.
>>
>>
>>
>> The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code points of the characters of the string. That's it. Then it uses a memory-efficient way to store this array of ints. But that has nothing to do with character sets. The same principle could be used for any array of ints.
>>
>>
>>
>> So you are seeking problems where there are none. And you would have a lot more peace of mind if you stopped doing this.
>>
>>
>>
>> [*] array in the C sense.
>>
>> -- 
>>
>> Piet van Oostrum <piet@vanoostrum.org>
>>
>> WWW: http://pietvanoostrum.com/
>>
>> PGP key: [8DAE142BE17999C4]
> ----------
>
>
> Due to its nature, a character cann't be handled in the
> same way a one another type. That's the purpose of the UTF.
>
> -----
>
> Chunk latin-1, perfomance
>
> ref:
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.13144639994075646
>
>>>> timeit.timeit("a = 'hundrez'; 'x' in a")
> 0.13780295544393084
>
> Chunk ucs2, perfomance
>
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.23505392241617074
>
> Chunk ucs4, perfomance
>
>>>> timeit.timeit("a = 'hundre\U0001d11e'; 'x' in a")
> 0.26266673650735584
>
> Comment: Such differences never happen with utf.
>
> -----
>
> Chunk latin-1, memory
>
>>>> sys.getsizeof('a')
> 26
>
> Chunk ucs2, memory
>
>>>> sys.getsizeof('€')
> 40
>
> Comment: 14 bytes more than latin-1
>
> Chunk ucs4, memory
>
>>>> sys.getsizeof('\U0001d11e')
> 44
>
> Comment: 18 bytes more than latin-1
>
> Comment: With utf, a char (in string or not) never exceed 4
>
> bytes.
>
> -----
>
> 'a' + '€' in utf, conceptually
>
> Concatenate the *unicode tranformation units*.
> Some kind of a real direct 'a' + '€'.
>
>
> 'a' + '€' in FSR, conceptually
>
> 1) Check the "internal coding" of 'a'
> 2) Check the "internal coding" of '€'
> 3) Compare these codings
>
> 4a) If they match, concatenate the bytes
>
> 4b) If they do not match
> 	5) Reencode the string which has to
> 	6) Concatenate
> 	7) Set the "internal coding" status for
> 	further processing
>
> -----
>
> Complicate and full of side effects, eg :
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('aé')
> 39
>
> Is not a latin-1 "é" supposed to count as a latin-1 "a" ?
>
> ----
>
> I picked up random methods, there may be variations, basically
> this general behaviour is always expected.
>
>
> jmf
>

jmf, thanks for your reply.  You've calmed my fears that there is 
something wrong with the Flexible String Representation.  None of the 
examples you show demonstrate any behavior contrary to the Unicode spec.

--Ned.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-06 02:11 -0700
  Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-06 10:57 +0000
  Re: Chardet, file, ... and the Flexible String Representation Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-09-06 13:10 +0200
  Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-06 07:02 -0400
  Re: Chardet, file, ... and the Flexible String Representation Piet van Oostrum <piet@vanoostrum.org> - 2013-09-06 11:46 -0400
    Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 02:04 +1000
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-06 12:59 -0400
    Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 03:04 +1000
    Re: Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-09 07:28 -0700
      Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-09 12:38 -0400
      Re: Chardet, file, ... and the Flexible String Representation Michael Torrie <torriem@gmail.com> - 2013-09-09 11:05 -0600
        Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve@pearwood.info> - 2013-09-10 04:58 +0000
      Re: Chardet, file, ... and the Flexible String Representation Terry Reedy <tjreedy@udel.edu> - 2013-09-09 16:47 -0400
      Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-10 11:36 -0400
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 14:34 -0400
    Re: Chardet, file, ... and the Flexible String Representation Ian Kelly <ian.g.kelly@gmail.com> - 2013-09-09 13:03 -0600
    Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 15:27 -0400
    Re: Chardet, file, ... and the Flexible String Representation Serhiy Storchaka <storchaka@gmail.com> - 2013-09-12 00:11 +0300

csiph-web