Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.015 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'int': 0.05; 'python': 0.08; '16-bit': 0.09; 'kinda': 0.09; 'pm,': 0.10; 'received:209.85.214.174': 0.14; 'received:mail- iw0-f174.google.com': 0.14; 'wrote:': 0.14; 'defined': 0.14; 'angelico': 0.16; 'encode': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'planes': 0.16; 'roy': 0.16; 'entries': 0.16; 'header:In-Reply-To:1': 0.21; 'fri,': 0.23; 'string': 0.26; "i'm": 0.27; 'message-id:@mail.gmail.com': 0.28; 'received:209.85.214': 0.28; 'character': 0.29; 'subject:how': 0.29; 'unicode': 0.29; 'least': 0.30; '(since': 0.30; 'characters,': 0.30; 'it.': 0.31; 'to:addr:python-list': 0.33; "i've": 0.33; 'chris': 0.34; 'however,': 0.34; 'there': 0.35; 'here,': 0.35; 'store': 0.35; 'uses': 0.36; 'table': 0.37; 'received:google.com': 0.37; 'received:209.85': 0.37; 'two': 0.37; 'subject:: ': 0.38; 'received:209': 0.39; 'to:addr:python.org': 0.39; 'full': 0.63; 'therefore,': 0.63; 'met': 0.65; 'plane': 0.67; 'article': 0.76; '"full': 0.84; '0-2': 0.84; 'before...': 0.84; 'printable': 0.84; 'total,': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=Yvnvy5devVxRpWjuKqYVMuKRTLEZVNM47L6hDrqUhkM=; b=YH133cpciJp4nCNl5c22PU5bllJab9c9JXLvYVHVGikEbThWsRWtaFRh3q8FvEqrt1 f3UuFf5cs6e7yktmwOAkt5fjGbACIeruxhnhoqUT0D05RngFciP+4KJPrx5oitrmvQa1 5jE8MEU58qF3I2SJDKPkGqeUOkYbcwWLiyewk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=hIf/MYS76sxAAkrxJwJV31PKcovGgSG6tjYlxx+dj3IxDYQ6wLt+KfsnlDeG8/gM9E ohwGQcVSGnp5wmz5cRHLXWmC9endm0HOVWZ/U+AZHuz+axV0taJvQIYLdLD48tAH6Ruw nOnYxuKuwoEtRXW8BnX6ZPP21Kx4QKr6htOwo= MIME-Version: 1.0 In-Reply-To: References: <9e861b0e-e768-401b-b5ca-190f20830a08@s9g2000yqm.googlegroups.com> <94ph22FrhvU5@mid.individual.net> Date: Fri, 3 Jun 2011 13:52:03 +1000 Subject: Re: how to avoid leading white spaces From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 23 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1307073127 news.xs4all.nl 49041 [::ffff:82.94.164.166]:50180 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:6908 On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith wrote: > In article , > =A0Chris Torek wrote: > >> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries (one per possible ord() value). > > I'm not sure what you mean by "full 16-bit Unicode string"? =A0Isn't > unicode inherently 32 bit? =A0Or at least 20-something bit? =A0Things lik= e > UTF-16 are just one way to encode it. The size of a Unicode character is like the size of a number. It's not defined in terms of a maximum. However, Unicode planes 0-2 have all the defined printable characters, and there are only 16 planes in total, so (since each plane is 2^16 characters) that kinda makes Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit numbers to store a 20-bit number. Why do I get the feeling I've met that before... Chris Angelico 136E:0100 CD 20 INT 20