Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #94720 > unrolled thread

Re: What happens when python seeks a text file

Started bydieter <dieter@handshake.de>
First post2015-07-29 07:52 +0200
Last post2015-07-29 07:52 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: What happens when python seeks a text file dieter <dieter@handshake.de> - 2015-07-29 07:52 +0200

#94720 — Re: What happens when python seeks a text file

Fromdieter <dieter@handshake.de>
Date2015-07-29 07:52 +0200
SubjectRe: What happens when python seeks a text file
Message-ID<mailman.1058.1438149164.3674.python-list@python.org>
"=?GBK?B?wO68zsX0?=" <lijpbasin@126.com> writes:

> Hi, I tried using seek to reverse a text file after reading about the
> subject in the documentation:
>
> https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
>
> https://docs.python.org/3/library/io.html#io.TextIOBase.seek
> ...
> However, an exception is raised if a file with the same content encoded in
> GBK is provided:
>
>     $ ./reverse_text_by_seek3.py Moon-gbk.txt
>     [0, 7, 8, 19, 21, 32, 42, 53, 64]
>     µÍͷ˼¹ÊÏç
>     ¾ÙÍ·ÍûÃ÷ÔÂ
>     Traceback (most recent call last):
>       File "./reverse_text_by_seek3.py", line 21, in <module>
>         print(f.readline(), end="")
>     UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 8: illegal multibyte sequence

The "seek" works on byte level while decoding works on character level
where some characters can be composed of several bytes.

The error you observe indicates that you have "seeked" somewhere
inside a character, not at a legal character beginning.

That you get an error for "gbk" and not for "utf-8" is a bit of
an "accident". The same problem can happen for "utf-8" but the probability
might by sligtly inferior.


Seek only to byte position for which you know that they are also
character beginnings -- e.g. line beginnings.

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web