Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!transit3.readnews.com!news-out.readnews.com!news-xxxfer.readnews.com!panix!roy From: Roy Smith Newsgroups: comp.lang.python Subject: Re: how to avoid leading white spaces Date: Thu, 02 Jun 2011 23:44:40 -0400 Organization: PANIX Public Access Internet and UNIX, NYC Lines: 24 Message-ID: References: <9e861b0e-e768-401b-b5ca-190f20830a08@s9g2000yqm.googlegroups.com> <94ph22FrhvU5@mid.individual.net> NNTP-Posting-Host: localhost X-Trace: reader1.panix.com 1307072682 19278 127.0.0.1 (3 Jun 2011 03:44:42 GMT) X-Complaints-To: abuse@panix.com NNTP-Posting-Date: Fri, 3 Jun 2011 03:44:42 +0000 (UTC) User-Agent: MT-NewsWatcher/3.5.3b3 (Intel Mac OS X) Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:6907 In article , Chris Torek wrote: > Python might be penalized by its use of Unicode here, since a > Boyer-Moore table for a full 16-bit Unicode string would need > 65536 entries (one per possible ord() value). I'm not sure what you mean by "full 16-bit Unicode string"? Isn't unicode inherently 32 bit? Or at least 20-something bit? Things like UTF-16 are just one way to encode it. In any case, while I could imagine building a 2^16 entry jump table, clearly it's infeasible (with today's hardware) to build a 2^32 entry table. But, there's nothing that really requires you to build a table at all. If I understand the algorithm right, all that's really required is that you can map a character to a shift value. For an 8 bit character set, an indexed jump table makes sense. For a larger character set, I would imagine you would do some heuristic pre-processing to see if your search string consisted only of characters in one unicode plane and use that fact to build a table which only indexes that plane. Or, maybe use a hash table instead of a regular indexed table. Not as fast, but only slower by a small constant factor, which is not a horrendous price to pay in a fully i18n world :-)