Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!feeder.news-service.com!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!usenet-fr.net!feeder1-2.proxad.net!proxad.net!feeder2-2.proxad.net!nx02.iad01.newshosting.com!newshosting.com!69.16.185.16.MISMATCH!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!spln!extra.newsguy.com!newsp.newsguy.com!not-for-mail From: Chris Torek Newsgroups: comp.lang.python Subject: Re: how to avoid leading white spaces Date: 9 Jun 2011 02:32:08 GMT Organization: None of the Above Lines: 36 Message-ID: References: NNTP-Posting-Host: pc577d8fa08f4f5d9fa8bd94c28e96b05add6a5e66298286c.newsdawg.com X-Newsreader: trn 4.0-test76 (Apr 2, 2001) Originator: torek@elf.torek.net (Chris Torek) Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:7267 >On 03/06/2011 03:58, Chris Torek wrote: >>> ------------------------------------------------- >> This is a bit surprising, since both "s1 in s2" and re.search() >> could use a Boyer-Moore-based algorithm for a sufficiently-long >> fixed string, and the time required should be proportional to that >> needed to set up the skip table. The re.compile() gets to re-use >> the table every time. In article Ian wrote: >Is that true? My immediate thought is that Boyer-Moore would quickly give >the number of characters to skip, but skipping them would be slow because >UTF8 encoded characters are variable sized, and the string would have to be >walked anyway. As I understand it, strings in python 3 are Unicode internally and (apparently) use wchar_t. Byte strings in python 3 are of course byte strings, not UTF-8 encoded. >Or am I misunderstanding something. Here's python 2.7 on a Linux box: >>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc') 38 39 40 >>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc') 56 60 64 This implies that strings in Python 2.x are just byte strings (same as b"..." in Python 3.x) and never actually contain unicode; and unicode strings (same as "..." in Python 3.x) use 4-byte "characters" per that box's wchar_t. -- In-Real-Life: Chris Torek, Wind River Systems Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603 email: gmail (figure it out) http://web.torek.net/torek/index.html