Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!selfless.tophat.at!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=bQsSLTZf7lRUzyFsjbHCBVS4pKXRGL1m8QZsuGwL1RXQr/7QZTjOOTgWqn31U9FFUT 922I0KXAx25ayjBSZunc623P3ToTs7Gujn9SnXVTHGxVgpdBYl7wwiLjCI4mQv+30MMb 9NN06LUQE2kPUVHZegLhuEKxrDbAx5Wxy0Txk=
MIME-Version: 1.0
In-Reply-To: <4DDD7A27.60602@mrabarnett.plus.com>
References: <3d81e2a0-6c86-4f12-a1c4-ce4c736172b6@y31g2000vbp.googlegroups.com> <4DDD5FD2.8040607@mrabarnett.plus.com> <BANLkTik1NyMO8vEfb-+oO_7jLD9B=+ZMRA@mail.gmail.com> <4DDD7A27.60602@mrabarnett.plus.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Wed, 25 May 2011 19:06:04 -0600
Subject: Re: Python 3.2 bug? Reading the last line of a file
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2103.1306371996.9059.python-list@python.org>
Lines: 52
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:6285

On Wed, May 25, 2011 at 3:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.

If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes.  For example:

>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data

The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:

>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'

But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data.  However:

>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'

And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state.  Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent.  This is just a
demonstration of how the seek() value can include decoder state.

Cheers,
Ian