Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #6259 > unrolled thread
| Started by | "tkpmep@hotmail.com" <tkpmep@hotmail.com> |
|---|---|
| First post | 2011-05-25 12:33 -0700 |
| Last post | 2011-05-25 19:06 -0600 |
| Articles | 11 — 5 participants |
Back to article view | Back to comp.lang.python
Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 12:33 -0700
Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 21:00 +0100
Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 14:54 -0600
Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-25 22:52 +0100
Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-25 16:25 -0700
Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 16:58 -0700
Re: Python 3.2 bug? Reading the last line of a file MRAB <python@mrabarnett.plus.com> - 2011-05-26 00:56 +0100
Re: Python 3.2 bug? Reading the last line of a file Ethan Furman <ethan@stoneleaf.us> - 2011-05-25 17:32 -0700
Re: Python 3.2 bug? Reading the last line of a file Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2011-05-26 08:09 +0300
Re: Python 3.2 bug? Reading the last line of a file "tkpmep@hotmail.com" <tkpmep@hotmail.com> - 2011-05-27 12:21 -0700
Re: Python 3.2 bug? Reading the last line of a file Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-25 19:06 -0600
| From | "tkpmep@hotmail.com" <tkpmep@hotmail.com> |
|---|---|
| Date | 2011-05-25 12:33 -0700 |
| Subject | Python 3.2 bug? Reading the last line of a file |
| Message-ID | <3d81e2a0-6c86-4f12-a1c4-ce4c736172b6@y31g2000vbp.googlegroups.com> |
The following function that returns the last line of a file works
perfectly well under Python 2.71. but fails reliably under Python 3.2.
Is this a bug, or am I doing something wrong? Any help would be
greatly appreciated.
import os
def lastLine(filename):
'''
Returns the last line of a file
file.seek takes an optional 'whence' argument which allows you
to
start looking at the end, so you can just work back from there
till
you hit the first newline that has anything after it
Works perfectly under Python 2.7, but not under 3.2!
'''
offset = -50
with open(filename) as f:
while offset > -1024:
offset *= 2
f.seek(offset, os.SEEK_END)
lines = f.readlines()
if len(lines) > 1:
return lines[-1]
If I execute this with a valid filename fn. I get the following error
message:
>>> lastLine(fn)
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
lastLine(fn)
File "<pyshell#11>", line 13, in lastLine
f.seek(offset, os.SEEK_END)
io.UnsupportedOperation: can't do nonzero end-relative seeks
Sincerely
Thomas Philips
[toc] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-05-25 21:00 +0100 |
| Message-ID | <mailman.2091.1306353619.9059.python-list@python.org> |
| In reply to | #6259 |
On 25/05/2011 20:33, tkpmep@hotmail.com wrote: > The following function that returns the last line of a file works > perfectly well under Python 2.71. but fails reliably under Python 3.2. > Is this a bug, or am I doing something wrong? Any help would be > greatly appreciated. > > > import os > > def lastLine(filename): > ''' > Returns the last line of a file > file.seek takes an optional 'whence' argument which allows you > to > start looking at the end, so you can just work back from there > till > you hit the first newline that has anything after it > Works perfectly under Python 2.7, but not under 3.2! > ''' > offset = -50 > with open(filename) as f: > while offset> -1024: > offset *= 2 > f.seek(offset, os.SEEK_END) > lines = f.readlines() > if len(lines)> 1: > return lines[-1] > > If I execute this with a valid filename fn. I get the following error > message: > >>>> lastLine(fn) > Traceback (most recent call last): > File "<pyshell#12>", line 1, in<module> > lastLine(fn) > File "<pyshell#11>", line 13, in lastLine > f.seek(offset, os.SEEK_END) > io.UnsupportedOperation: can't do nonzero end-relative seeks > You're opening the file in text mode, and seeking relative to the end of the file is not allowed in text mode, presumably because the file contents have to be decoded, and, in general, seeking to an arbitrary position within a sequence of encoded bytes can have undefined results when you attempt to decode to Unicode starting from that position. The strange thing is that you _are_ allowed to seek relative to the start of the file. Try opening the file in binary mode and do the decoding yourself, catching the DecodeError exceptions if/when they occur.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-05-25 14:54 -0600 |
| Message-ID | <mailman.2093.1306356898.9059.python-list@python.org> |
| In reply to | #6259 |
On Wed, May 25, 2011 at 2:00 PM, MRAB <python@mrabarnett.plus.com> wrote: > You're opening the file in text mode, and seeking relative to the end > of the file is not allowed in text mode, presumably because the file > contents have to be decoded, and, in general, seeking to an arbitrary > position within a sequence of encoded bytes can have undefined results > when you attempt to decode to Unicode starting from that position. > > The strange thing is that you _are_ allowed to seek relative to the > start of the file. I think that with text files seek() is only really meant to be called with values returned from tell(), which may include the decoder state in its return value.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-05-25 22:52 +0100 |
| Message-ID | <mailman.2096.1306360361.9059.python-list@python.org> |
| In reply to | #6259 |
On 25/05/2011 21:54, Ian Kelly wrote: > On Wed, May 25, 2011 at 2:00 PM, MRAB<python@mrabarnett.plus.com> wrote: >> You're opening the file in text mode, and seeking relative to the end >> of the file is not allowed in text mode, presumably because the file >> contents have to be decoded, and, in general, seeking to an arbitrary >> position within a sequence of encoded bytes can have undefined results >> when you attempt to decode to Unicode starting from that position. >> >> The strange thing is that you _are_ allowed to seek relative to the >> start of the file. > > I think that with text files seek() is only really meant to be called > with values returned from tell(), which may include the decoder state > in its return value. What do you mean by "may include the decoder state in its return value"? It does make sense that the values returned from tell() won't be in the middle of an encoded sequence of bytes.
[toc] | [prev] | [next] | [standalone]
| From | "tkpmep@hotmail.com" <tkpmep@hotmail.com> |
|---|---|
| Date | 2011-05-25 16:25 -0700 |
| Message-ID | <55262a36-ca53-48dd-b563-1847f9442bae@dn9g2000vbb.googlegroups.com> |
| In reply to | #6268 |
Thanks for the guidance - it was indeed an issue with reading in
binary vs. text., and I do now succeed in reading the last line,
except that I now seem unable to split it, as I demonstrate below.
Here's what I get when I read the last line in text mode using 2.7.1
and in binary mode using 3.2 respectively under IDLE:
2.7.1
Name 31/12/2009 0 0 0
3.2
b'Name\t31/12/2009\t0\t0\t0\r\n'
if, under 2.7.1 I read the file in text mode and write
>>> x = lastLine(fn)
I can then cleanly split the line to get its contents
>>> x.split('\t')
['Name', '31/12/2009', '0', '0', '0\n']
but under 3.2, with its binary read, I get
>>> x.split('\t')
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
x.split('\t')
TypeError: Type str doesn't support the buffer API
If I remove the '\t', the split now works and I get a list of bytes
literals
>>> x.split()
[b'Name', b'31/12/2009', b'0', b'0', b'0']
Looking through the docs did not clarify my understanding of the
issue. Why can I not split on '\t' when reading in binary mode?
Sincerely
Thomas Philips
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2011-05-25 16:58 -0700 |
| Message-ID | <mailman.2098.1306367183.9059.python-list@python.org> |
| In reply to | #6273 |
tkpmep@hotmail.com wrote:
> Thanks for the guidance - it was indeed an issue with reading in
> binary vs. text., and I do now succeed in reading the last line,
> except that I now seem unable to split it, as I demonstrate below.
> Here's what I get when I read the last line in text mode using 2.7.1
> and in binary mode using 3.2 respectively under IDLE:
>
> 3.2
> b'Name\t31/12/2009\t0\t0\t0\r\n'
>
> under 3.2, with its binary read, I get
>--> x.split('\t')
> Traceback (most recent call last):
> File "<pyshell#26>", line 1, in <module>
> x.split('\t')
> TypeError: Type str doesn't support the buffer API
You are trying to split a bytes object with a str object -- the two are
not compatible. Try splitting with the bytes object b'\t'.
~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-05-26 00:56 +0100 |
| Message-ID | <mailman.2100.1306367807.9059.python-list@python.org> |
| In reply to | #6273 |
On 26/05/2011 00:25, tkpmep@hotmail.com wrote:
> Thanks for the guidance - it was indeed an issue with reading in
> binary vs. text., and I do now succeed in reading the last line,
> except that I now seem unable to split it, as I demonstrate below.
> Here's what I get when I read the last line in text mode using 2.7.1
> and in binary mode using 3.2 respectively under IDLE:
>
> 2.7.1
> Name 31/12/2009 0 0 0
>
> 3.2
> b'Name\t31/12/2009\t0\t0\t0\r\n'
>
> if, under 2.7.1 I read the file in text mode and write
>>>> x = lastLine(fn)
> I can then cleanly split the line to get its contents
>>>> x.split('\t')
> ['Name', '31/12/2009', '0', '0', '0\n']
>
> but under 3.2, with its binary read, I get
>>>> x.split('\t')
> Traceback (most recent call last):
> File "<pyshell#26>", line 1, in<module>
> x.split('\t')
> TypeError: Type str doesn't support the buffer API
>
> If I remove the '\t', the split now works and I get a list of bytes
> literals
>>>> x.split()
> [b'Name', b'31/12/2009', b'0', b'0', b'0']
>
> Looking through the docs did not clarify my understanding of the
> issue. Why can I not split on '\t' when reading in binary mode?
>
x.split('\t') tries to split on '\t', a string (str), but x is a
bytestring (bytes).
Do x.split(b'\t') instead.
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2011-05-25 17:32 -0700 |
| Message-ID | <mailman.2101.1306369224.9059.python-list@python.org> |
| In reply to | #6273 |
MRAB wrote:
> On 26/05/2011 00:25, tkpmep@hotmail.com wrote:
>> Thanks for the guidance - it was indeed an issue with reading in
>> binary vs. text., and I do now succeed in reading the last line,
>> except that I now seem unable to split it, as I demonstrate below.
>> Here's what I get when I read the last line in text mode using 2.7.1
>> and in binary mode using 3.2 respectively under IDLE:
>>
>> 2.7.1
>> Name 31/12/2009 0 0 0
>>
>> 3.2
>> b'Name\t31/12/2009\t0\t0\t0\r\n'
>>
>> if, under 2.7.1 I read the file in text mode and write
>>>>> x = lastLine(fn)
>> I can then cleanly split the line to get its contents
>>>>> x.split('\t')
>> ['Name', '31/12/2009', '0', '0', '0\n']
>>
>> but under 3.2, with its binary read, I get
>>>>> x.split('\t')
>> Traceback (most recent call last):
>> File "<pyshell#26>", line 1, in<module>
>> x.split('\t')
>> TypeError: Type str doesn't support the buffer API
>>
>> If I remove the '\t', the split now works and I get a list of bytes
>> literals
>>>>> x.split()
>> [b'Name', b'31/12/2009', b'0', b'0', b'0']
>>
>> Looking through the docs did not clarify my understanding of the
>> issue. Why can I not split on '\t' when reading in binary mode?
>>
> x.split('\t') tries to split on '\t', a string (str), but x is a
> bytestring (bytes).
>
> Do x.split(b'\t') instead.
<nitpick>
Instances of the bytes class are more appropriately called 'bytes
objects' rather than 'bytestrings' as they are really lists of integers.
Accessing a single element of a bytes object does not return a bytes
object, but rather the integer at that location; i.e.
--> b'xyz'[1]
121
Contrast that with the str type where
--> 'xyz'[1]
'y'
</nitpick>
~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2011-05-26 08:09 +0300 |
| Message-ID | <qot7h9eqc8h.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #6273 |
tkpmep@hotmail.com writes:
> Looking through the docs did not clarify my understanding of the
> issue. Why can I not split on '\t' when reading in binary mode?
You can split on b'\t' to get a list of byteses, which you can then
decode if you want them as strings.
You can decode the bytes to get a string and then split on '\t' to get
strings.
>>> b'tic\ttac\ttoe'.split(b'\t')
[b'tic', b'tac', b'toe']
>>> b'tic\ttac\ttoe'.decode('utf-8').split('\t')
['tic', 'tac', 'toe']
[toc] | [prev] | [next] | [standalone]
| From | "tkpmep@hotmail.com" <tkpmep@hotmail.com> |
|---|---|
| Date | 2011-05-27 12:21 -0700 |
| Message-ID | <ae81fd19-eef7-4937-9003-75b6129906d5@32g2000vbe.googlegroups.com> |
| In reply to | #6296 |
This is exactly what I want to do - I can then pick up various elements of the list and turn them into floats, ints, etc. I have not ever used decode, and will look it up in the docs to better understand it. I can't thank everyone enough for the generous serving of help and guidance - I certainly would not have discovered all this on my own. Sincerely Thomas Philips
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-05-25 19:06 -0600 |
| Message-ID | <mailman.2103.1306371996.9059.python-list@python.org> |
| In reply to | #6259 |
On Wed, May 25, 2011 at 3:52 PM, MRAB <python@mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.
If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes. For example:
>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data
The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:
>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'
But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data. However:
>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'
And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state. Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent. This is just a
demonstration of how the seek() value can include decoder state.
Cheers,
Ian
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web