Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Thu, 28 Mar 2013 21:56:05 -0700
From: Ethan Furman <ethan@stoneleaf.us>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2
MIME-Version: 1.0
To: python-list@python.org
Subject: unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]
References: <mailman.3703.1364248275.2939.python-list@python.org> <a52fbe9d-db14-4ed2-bb49-adfb4b56f973@k4g2000yqn.googlegroups.com> <mailman.3771.1364324590.2939.python-list@python.org> <0b779c80-4f50-4716-8c30-47755c15f304@m12g2000yqp.googlegroups.com> <kit1kg$g2u$1@ger.gmane.org> <nad-98F0A4.17004226032013@news.gmane.org> <kitdqr$4m4$2@ger.gmane.org> <nad-8CB9C0.18315026032013@news.gmane.org> <mailman.3805.1364385073.2939.python-list@python.org> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <mailman.3845.1364441182.2939.python-list@python.org> <d2cc443a-e049-42ed-abc6-66b5ea600fe7@j1g2000pbq.googlegroups.com> <mailman.3860.1364451682.2939.python-list@python.org> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <mailman.3863.1364463394.2939.python-list@python.org> <rOednY4OeOjbqcnMnZ2dnUVZ_oWdnZ2d@westnet.com.au> <51543f45$0$29998$c3e8da3$5496439d@news.astraweb.com> <-LGdnWTpyKcdkcjMnZ2dnUVZ_jCdnZ2d@westnet.com.au>
In-Reply-To: <-LGdnWTpyKcdkcjMnZ2dnUVZ_jCdnZ2d@westnet.com.au>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3936.1364533284.2939.python-list@python.org>
Lines: 30
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:42222

On 03/28/2013 08:34 PM, Neil Hodgson wrote:
> Steven D'Aprano:
>
>> Any string method that takes a starting offset requires the method to
>> walk the string byte-by-byte. I've even seen languages put responsibility
>> for dealing with that onto the programmer: the "start offset" is given in
>> *bytes*, not characters. I don't remember what language this was... it
>> might have been Haskell? Whatever it was, it horrified me.
>
>     It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural.

Horrifying or not, I am willing to give up a small amount of speed for correctness.  Heck, I'm willing to give up a lot 
of speed for correctness.  Once I have my slow but correct prototype going I can recode in a faster language (if needed) 
and compare it's blazingly fast output with my slowly-generated but known-good output.

>  You can wrap
> access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are
> safe only allowing the iterator to point at the start/end of valid characters.

Sure.  Or I can let Python handle it for me.


>     The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside
> Latin-1 will double in size as a Python string.

True.  But how often do you have the entire document as a single string?  Use readlines() instead of read().  Besides, 
memory is cheap.

--
~Ethan~