Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!tudelft.nl!txtfeed1.tudelft.nl!dedekind.zen.co.uk!zen.net.uk!hamilton.zen.co.uk!shaftesbury.zen.co.uk.POSTED!not-for-mail From: Nobody Subject: Re: how to avoid leading white spaces Date: Sat, 04 Jun 2011 20:44:56 +0100 User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.) Message-Id: Newsgroups: comp.lang.python References: <9e861b0e-e768-401b-b5ca-190f20830a08@s9g2000yqm.googlegroups.com> <94ph22FrhvU5@mid.individual.net> <94tgqfF4tiU1@mid.individual.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lines: 17 Organization: Zen Internet NNTP-Posting-Host: 9d6029ca.news.zen.co.uk X-Trace: DXC=k`ocOIeI\K0f<^?D`?[F`:nok4Z\> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries > > But is there any need for the Boyer-Moore algorithm to > operate on characters? > > Seems to me you could just as well chop the UTF-16 up > into bytes and apply Boyer-Moore to them, and it would > work about as well. No, because that won't care about alignment. E.g. on a big-endian architecture, if you search for '\u2345' in the string '\u0123\u4567', it will find a match (at an offset of 1 byte).