Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #6285

Re: Python 3.2 bug? Reading the last line of a file

References <3d81e2a0-6c86-4f12-a1c4-ce4c736172b6@y31g2000vbp.googlegroups.com> <4DDD5FD2.8040607@mrabarnett.plus.com> <BANLkTik1NyMO8vEfb-+oO_7jLD9B=+ZMRA@mail.gmail.com> <4DDD7A27.60602@mrabarnett.plus.com>
From Ian Kelly <ian.g.kelly@gmail.com>
Date 2011-05-25 19:06 -0600
Subject Re: Python 3.2 bug? Reading the last line of a file
Newsgroups comp.lang.python
Message-ID <mailman.2103.1306371996.9059.python-list@python.org> (permalink)

Show all headers | View raw


On Wed, May 25, 2011 at 3:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.

If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes.  For example:

>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data

The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:

>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'

But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data.  However:

>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'

And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state.  Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent.  This is just a
demonstration of how the seek() value can include decoder state.

Cheers,
Ian

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 12:33 -0700
  Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 21:00 +0100
  Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 14:54 -0600
  Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 22:52 +0100
    Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 16:25 -0700
      Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 16:58 -0700
      Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-26 00:56 +0100
      Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 17:32 -0700
      Re: Python 3.2 bug? Reading the last line of a file Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2011-05-26 08:09 +0300
        Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-27 12:21 -0700
  Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 19:06 -0600

csiph-web