Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #6259 > unrolled thread

Python 3.2 bug? Reading the last line of a file

Started by"tkpmep@hotmail.com" <tkpmep@hotmail.com>
First post2011-05-25 12:33 -0700
Last post2011-05-25 19:06 -0600
Articles 11 — 5 participants

Back to article view | Back to comp.lang.python


Contents

  Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 12:33 -0700
    Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 21:00 +0100
    Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 14:54 -0600
    Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 22:52 +0100
      Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 16:25 -0700
        Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 16:58 -0700
        Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-26 00:56 +0100
        Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 17:32 -0700
        Re: Python 3.2 bug? Reading the last line of a file Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2011-05-26 08:09 +0300
          Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-27 12:21 -0700
    Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 19:06 -0600

#6259 — Python 3.2 bug? Reading the last line of a file

From"tkpmep@hotmail.com" <tkpmep@hotmail.com>
Date2011-05-25 12:33 -0700
SubjectPython 3.2 bug? Reading the last line of a file
Message-ID<3d81e2a0-6c86-4f12-a1c4-ce4c736172b6@y31g2000vbp.googlegroups.com>
The following function that returns the last line of a file works
perfectly well under Python 2.71. but fails reliably under Python 3.2.
Is this a bug, or am I doing something wrong? Any help would be
greatly appreciated.


import os

def lastLine(filename):
    '''
        Returns the last line of a file
        file.seek takes an optional 'whence' argument which allows you
to
        start looking at the end, so you can just work back from there
till
        you hit the first newline that has anything after it
        Works perfectly under Python 2.7, but not under 3.2!
   '''
    offset = -50
    with open(filename) as f:
        while offset > -1024:
            offset *= 2
            f.seek(offset, os.SEEK_END)
            lines = f.readlines()
            if len(lines) > 1:
                return lines[-1]

If I execute this with a valid filename fn. I get the following error
message:

>>> lastLine(fn)
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    lastLine(fn)
  File "<pyshell#11>", line 13, in lastLine
    f.seek(offset, os.SEEK_END)
io.UnsupportedOperation: can't do nonzero end-relative seeks

Sincerely

Thomas Philips

[toc] | [next] | [standalone]


#6261

FromMRAB <python@mrabarnett.plus.com>
Date2011-05-25 21:00 +0100
Message-ID<mailman.2091.1306353619.9059.python-list@python.org>
In reply to#6259
On 25/05/2011 20:33, tkpmep@hotmail.com wrote:
> The following function that returns the last line of a file works
> perfectly well under Python 2.71. but fails reliably under Python 3.2.
> Is this a bug, or am I doing something wrong? Any help would be
> greatly appreciated.
>
>
> import os
>
> def lastLine(filename):
>      '''
>          Returns the last line of a file
>          file.seek takes an optional 'whence' argument which allows you
> to
>          start looking at the end, so you can just work back from there
> till
>          you hit the first newline that has anything after it
>          Works perfectly under Python 2.7, but not under 3.2!
>     '''
>      offset = -50
>      with open(filename) as f:
>          while offset>  -1024:
>              offset *= 2
>              f.seek(offset, os.SEEK_END)
>              lines = f.readlines()
>              if len(lines)>  1:
>                  return lines[-1]
>
> If I execute this with a valid filename fn. I get the following error
> message:
>
>>>> lastLine(fn)
> Traceback (most recent call last):
>    File "<pyshell#12>", line 1, in<module>
>      lastLine(fn)
>    File "<pyshell#11>", line 13, in lastLine
>      f.seek(offset, os.SEEK_END)
> io.UnsupportedOperation: can't do nonzero end-relative seeks
>
You're opening the file in text mode, and seeking relative to the end
of the file is not allowed in text mode, presumably because the file
contents have to be decoded, and, in general, seeking to an arbitrary
position within a sequence of encoded bytes can have undefined results
when you attempt to decode to Unicode starting from that position.

The strange thing is that you _are_ allowed to seek relative to the
start of the file.

Try opening the file in binary mode and do the decoding yourself,
catching the DecodeError exceptions if/when they occur.

[toc] | [prev] | [next] | [standalone]


#6262

FromIan Kelly <ian.g.kelly@gmail.com>
Date2011-05-25 14:54 -0600
Message-ID<mailman.2093.1306356898.9059.python-list@python.org>
In reply to#6259
On Wed, May 25, 2011 at 2:00 PM, MRAB <python@mrabarnett.plus.com> wrote:
> You're opening the file in text mode, and seeking relative to the end
> of the file is not allowed in text mode, presumably because the file
> contents have to be decoded, and, in general, seeking to an arbitrary
> position within a sequence of encoded bytes can have undefined results
> when you attempt to decode to Unicode starting from that position.
>
> The strange thing is that you _are_ allowed to seek relative to the
> start of the file.

I think that with text files seek() is only really meant to be called
with values returned from tell(), which may include the decoder state
in its return value.

[toc] | [prev] | [next] | [standalone]


#6268

FromMRAB <python@mrabarnett.plus.com>
Date2011-05-25 22:52 +0100
Message-ID<mailman.2096.1306360361.9059.python-list@python.org>
In reply to#6259
On 25/05/2011 21:54, Ian Kelly wrote:
> On Wed, May 25, 2011 at 2:00 PM, MRAB<python@mrabarnett.plus.com>  wrote:
>> You're opening the file in text mode, and seeking relative to the end
>> of the file is not allowed in text mode, presumably because the file
>> contents have to be decoded, and, in general, seeking to an arbitrary
>> position within a sequence of encoded bytes can have undefined results
>> when you attempt to decode to Unicode starting from that position.
>>
>> The strange thing is that you _are_ allowed to seek relative to the
>> start of the file.
>
> I think that with text files seek() is only really meant to be called
> with values returned from tell(), which may include the decoder state
> in its return value.

What do you mean by "may include the decoder state in its return value"?

It does make sense that the values returned from tell() won't be in the
middle of an encoded sequence of bytes.

[toc] | [prev] | [next] | [standalone]


#6273

From"tkpmep@hotmail.com" <tkpmep@hotmail.com>
Date2011-05-25 16:25 -0700
Message-ID<55262a36-ca53-48dd-b563-1847f9442bae@dn9g2000vbb.googlegroups.com>
In reply to#6268
Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:

2.7.1
Name	31/12/2009	0	0	0

3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'

if, under 2.7.1 I read the file in text mode and write
>>> x = lastLine(fn)
I can then cleanly split the line to get its contents
>>> x.split('\t')
['Name', '31/12/2009', '0', '0', '0\n']

but under 3.2, with its binary read, I get
>>> x.split('\t')
Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>
    x.split('\t')
TypeError: Type str doesn't support the buffer API

If I remove the '\t', the split now works and I get a list of bytes
literals
>>> x.split()
[b'Name', b'31/12/2009', b'0', b'0', b'0']

Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?

Sincerely

Thomas Philips

[toc] | [prev] | [next] | [standalone]


#6277

FromEthan Furman <ethan@stoneleaf.us>
Date2011-05-25 16:58 -0700
Message-ID<mailman.2098.1306367183.9059.python-list@python.org>
In reply to#6273
tkpmep@hotmail.com wrote:
> Thanks for the guidance - it was indeed an issue with reading in
> binary vs. text., and I do now succeed in reading the last line,
> except that I now seem unable to split it, as I demonstrate below.
> Here's what I get when I read the last line in text mode using 2.7.1
> and in binary mode using 3.2 respectively under IDLE:
> 
> 3.2
> b'Name\t31/12/2009\t0\t0\t0\r\n'
> 
> under 3.2, with its binary read, I get
>--> x.split('\t')
> Traceback (most recent call last):
>   File "<pyshell#26>", line 1, in <module>
>     x.split('\t')
> TypeError: Type str doesn't support the buffer API

You are trying to split a bytes object with a str object -- the two are 
not compatible.  Try splitting with the bytes object b'\t'.

~Ethan~

[toc] | [prev] | [next] | [standalone]


#6280

FromMRAB <python@mrabarnett.plus.com>
Date2011-05-26 00:56 +0100
Message-ID<mailman.2100.1306367807.9059.python-list@python.org>
In reply to#6273
On 26/05/2011 00:25, tkpmep@hotmail.com wrote:
> Thanks for the guidance - it was indeed an issue with reading in
> binary vs. text., and I do now succeed in reading the last line,
> except that I now seem unable to split it, as I demonstrate below.
> Here's what I get when I read the last line in text mode using 2.7.1
> and in binary mode using 3.2 respectively under IDLE:
>
> 2.7.1
> Name	31/12/2009	0	0	0
>
> 3.2
> b'Name\t31/12/2009\t0\t0\t0\r\n'
>
> if, under 2.7.1 I read the file in text mode and write
>>>> x = lastLine(fn)
> I can then cleanly split the line to get its contents
>>>> x.split('\t')
> ['Name', '31/12/2009', '0', '0', '0\n']
>
> but under 3.2, with its binary read, I get
>>>> x.split('\t')
> Traceback (most recent call last):
>    File "<pyshell#26>", line 1, in<module>
>      x.split('\t')
> TypeError: Type str doesn't support the buffer API
>
> If I remove the '\t', the split now works and I get a list of bytes
> literals
>>>> x.split()
> [b'Name', b'31/12/2009', b'0', b'0', b'0']
>
> Looking through the docs did not clarify my understanding of the
> issue. Why can I not split on '\t' when reading in binary mode?
>
x.split('\t') tries to split on '\t', a string (str), but x is a
bytestring (bytes).

Do x.split(b'\t') instead.

[toc] | [prev] | [next] | [standalone]


#6282

FromEthan Furman <ethan@stoneleaf.us>
Date2011-05-25 17:32 -0700
Message-ID<mailman.2101.1306369224.9059.python-list@python.org>
In reply to#6273
MRAB wrote:
> On 26/05/2011 00:25, tkpmep@hotmail.com wrote:
>> Thanks for the guidance - it was indeed an issue with reading in
>> binary vs. text., and I do now succeed in reading the last line,
>> except that I now seem unable to split it, as I demonstrate below.
>> Here's what I get when I read the last line in text mode using 2.7.1
>> and in binary mode using 3.2 respectively under IDLE:
>>
>> 2.7.1
>> Name    31/12/2009    0    0    0
>>
>> 3.2
>> b'Name\t31/12/2009\t0\t0\t0\r\n'
>>
>> if, under 2.7.1 I read the file in text mode and write
>>>>> x = lastLine(fn)
>> I can then cleanly split the line to get its contents
>>>>> x.split('\t')
>> ['Name', '31/12/2009', '0', '0', '0\n']
>>
>> but under 3.2, with its binary read, I get
>>>>> x.split('\t')
>> Traceback (most recent call last):
>>    File "<pyshell#26>", line 1, in<module>
>>      x.split('\t')
>> TypeError: Type str doesn't support the buffer API
>>
>> If I remove the '\t', the split now works and I get a list of bytes
>> literals
>>>>> x.split()
>> [b'Name', b'31/12/2009', b'0', b'0', b'0']
>>
>> Looking through the docs did not clarify my understanding of the
>> issue. Why can I not split on '\t' when reading in binary mode?
>>
> x.split('\t') tries to split on '\t', a string (str), but x is a
> bytestring (bytes).
> 
> Do x.split(b'\t') instead.

<nitpick>
Instances of the bytes class are more appropriately called 'bytes 
objects' rather than 'bytestrings' as they are really lists of integers. 
  Accessing a single element of a bytes object does not return a bytes 
object, but rather the integer at that location; i.e.

--> b'xyz'[1]
121

Contrast that with the str type where

--> 'xyz'[1]
'y'
</nitpick>

~Ethan~

[toc] | [prev] | [next] | [standalone]


#6296

FromJussi Piitulainen <jpiitula@ling.helsinki.fi>
Date2011-05-26 08:09 +0300
Message-ID<qot7h9eqc8h.fsf@ruuvi.it.helsinki.fi>
In reply to#6273
tkpmep@hotmail.com writes:

> Looking through the docs did not clarify my understanding of the
> issue. Why can I not split on '\t' when reading in binary mode?

You can split on b'\t' to get a list of byteses, which you can then
decode if you want them as strings.

You can decode the bytes to get a string and then split on '\t' to get
strings.

 >>> b'tic\ttac\ttoe'.split(b'\t')
 [b'tic', b'tac', b'toe']
 >>> b'tic\ttac\ttoe'.decode('utf-8').split('\t')
 ['tic', 'tac', 'toe']

[toc] | [prev] | [next] | [standalone]


#6427

From"tkpmep@hotmail.com" <tkpmep@hotmail.com>
Date2011-05-27 12:21 -0700
Message-ID<ae81fd19-eef7-4937-9003-75b6129906d5@32g2000vbe.googlegroups.com>
In reply to#6296
This is exactly what I want to do - I can then pick up various
elements of the list and turn them into floats, ints, etc. I have not
ever used decode, and will look it up in the docs to better understand
it. I can't thank everyone enough for the generous serving of help and
guidance - I certainly would not have discovered all this on my own.

Sincerely


Thomas Philips

[toc] | [prev] | [next] | [standalone]


#6285

FromIan Kelly <ian.g.kelly@gmail.com>
Date2011-05-25 19:06 -0600
Message-ID<mailman.2103.1306371996.9059.python-list@python.org>
In reply to#6259
On Wed, May 25, 2011 at 3:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.

If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes.  For example:

>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data

The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:

>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'

But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data.  However:

>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'

And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state.  Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent.  This is just a
demonstration of how the seek() value can include decoder state.

Cheers,
Ian

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web