Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'differently': 0.07; 'subject:file': 0.07; 'string': 0.09; '32-bit': 0.09; 'ascii': 0.09; 'bytes,': 0.09; 'received:internal': 0.09; 'strings.': 0.09; 'python': 0.11; 'stored': 0.12; 'assume': 0.14; 'ascii,': 0.16; 'character.': 0.16; 'eliminating': 0.16; 'headers.': 0.16; 'length,': 0.16; 'message-id:@webmail.messagingengine.com': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:66.111.4.27': 0.16; 'received:messagingengine.com': 0.16; 'received:out3-smtp.messagingengine.com': 0.16; 'subject:String': 0.16; 'comment:': 0.16; 'size,': 0.16; 'wrote:': 0.18; 'basically': 0.19; '>>>': 0.22; '(in': 0.22; 'saying': 0.22; 'byte': 0.24; 'bytes': 0.24; 'case.': 0.24; 'char': 0.24; 'pointer': 0.24; 'mon,': 0.24; 'header:In-Reply-To:1': 0.27; 'words': 0.29; 'characters': 0.30; 'compared': 0.30; 'along': 0.30; 'header,': 0.31; 'overhead': 0.31; 'sep': 0.31; 'subject:the': 0.34; 'received:66': 0.35; 'skip:s 30': 0.35; 'more,': 0.35; 'but': 0.35; 'there': 0.35; 'i.e.': 0.36; 'shorter': 0.36; 'yours,': 0.36; 'possible': 0.36; 'should': 0.36; 'two': 0.37; 'received:10': 0.37; 'performance': 0.37; 'minimum': 0.38; 'system,': 0.38; 'mine': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'even': 0.60; 'from:no real name:2**0': 0.61; "you're": 0.61; "you've": 0.63; 'header:Message- Id:1': 0.63; 'email addr:gmail.com': 0.63; 'such': 0.63; 'happen': 0.63; 'more': 0.64; 'worth': 0.66; 'believe': 0.68; 'six': 0.68; 'exceed': 0.68; 'savings': 0.81; '2013,': 0.91; 'differences': 0.93; 'relating': 0.93 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=fastmail.us; h= message-id:from:to:mime-version:content-transfer-encoding :content-type:in-reply-to:references:subject:date; s=mesmtp; bh= 7FBAadQJv8c9ZGhdZrh07Z3kuc4=; b=ZZyxwOT8Hjh2uscNZ6EOk4ZGSL7ZZjuv dDRXwvIewjseTVuji5+IqqBemlbdNEJOoLekAMQo43kW4c3E+ps41ZbTVYNmooS8 wduGCpkqpwZXxNWuOhdxoVECnRMYhG93nkEzFB4/F2HHnnamwgbQBjoHCdd0CDd/ 7j3iJKmDExA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:from:to:mime-version :content-transfer-encoding:content-type:in-reply-to:references :subject:date; s=smtpout; bh=7FBAadQJv8c9ZGhdZrh07Z3kuc4=; b=fz5 hJgqTkKa7rWTyNyORsxShJVrNRdsJ0NButAiTDA7BCyR8XGuQipEpODc/EkUYJ5G r3Xq95CgXCRBL284dIlGCccti3yPfirMNK8ZsWt0aDi67xltJ9yDgGOi1tUxoJqR bCp7N82XHmuxSTBag1gW28imQxkUoFpFF9Zv3ZHU= X-Sasl-Enc: dHzFK54VHTuy/XczW4fincFMQDjtrizRKIbc2jWXfiG7 1378827393 From: random832@fastmail.us To: python-list@python.org MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-Mailer: MessagingEngine.com Webmail Interface - ajax-15090c31 In-Reply-To: <04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com> References: <4ce85ea8-4a4c-46cf-a546-ad999576a5f7@googlegroups.com> <04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com> Subject: Re: Chardet, file, ... and the Flexible String Representation Date: Tue, 10 Sep 2013 11:36:33 -0400 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 51 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1378827397 news.xs4all.nl 15880 [2001:888:2000:d::a6]:56501 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:53921 On Mon, Sep 9, 2013, at 10:28, wxjmfauth@gmail.com wrote: *time performance differences* >=20 > Comment: Such differences never happen with utf. Why is this bad? Keeping in mind that otherwise they would all be almost as slow as the UCS-4 case. > >>> sys.getsizeof('a') > 26 > >>> sys.getsizeof('=E2=82=AC') > 40 > >>> sys.getsizeof('\U0001d11e') > 44 >=20 > Comment: 18 bytes more than latin-1 >=20 > Comment: With utf, a char (in string or not) never exceed 4=20 A string is an object and needs to store the length, along with any overhead relating to object headers. I believe there is also an appended null character. Also, ASCII strings are stored differently from Latin-1 strings. >>> sys.getsizeof('a'*999) 1048 =3D 49 bytes overhead, 1 byte per character. >>> sys.getsizeof('\xa4'*999) 1072 =3D 74 bytes overhead, 1 byte per character. >>> sys.getsizeof('\u20ac'*999) 2072 =3D 76 bytes overhead, 2 bytes per character. >>> sys.getsizeof('\U0001d11e'*999) 4072 =3D 80 bytes overhead, 4 bytes per character. (I bet sys.getsizeof('\xa4') will return 38 on your system, so 44 is only six bytes more, not 18) If we did not have the FSR, everything would be 4 bytes per character. We might have less overhead, but a string only has to be 25 characters long before the savings from the shorter representation outweigh even having _no_ overhead, and every four bytes of overhead reduces that number by one. And you have a 32-bit python build, which has less overhead than mine - in yours, strings only have to be seven characters long for the FSR to be worth it. Assume the minimum possible overhead is two words for the object header, a size, and a pointer - i.e. sixteen bytes, compared to the 25 you've demonstrated for ASCII, and strings only need to be _two_ characters long for the FSR to be a better deal than always using UCS4 strings. The need for four-byte-per-character strings would not go away by eliminating the FSR, so you're basically saying that everything should be constrained to the worst-case performance scenario.