Groups > comp.lang.python > #52590 > unrolled thread

Proper use of the codecs module.

Started by	Andrew <andrew@invalid.invalid>
First post	2013-08-16 10:02 -0400
Last post	2013-08-16 23:14 +0100
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  Proper use of the codecs module. Andrew <andrew@invalid.invalid> - 2013-08-16 10:02 -0400
    Re: Proper use of the codecs module. Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-16 19:12 +0000
      Re: Proper use of the codecs module. Andrew <andrew@invalid.invalid> - 2013-08-16 16:16 -0400
    Re: Proper use of the codecs module. Chris Angelico <rosuav@gmail.com> - 2013-08-16 23:14 +0100

#52590 — Proper use of the codecs module.

From	Andrew <andrew@invalid.invalid>
Date	2013-08-16 10:02 -0400
Subject	Proper use of the codecs module.
Message-ID	<1efhl8i0dmr9b.15q8opn6p0cj3.dlg@40tude.net>

I have a mixed binary/text file[0], and the text portions use a radically
nonstandard character set. I want to read them easily given information
about the character encoding and an offset for the beginning of a string. 

The descriptions of the codecs module and codecs.register() in particular
seem to suggest that this is already supported in the standard library.
However, I can't find any examples of its proper use. Most people who use
the module seem to want to read utf files in python 2.x.[1] I would like to
know how to correctly set up a new codec for reading files that have
nonstandard encodings. 

I have two other related questions: 

How does seek() work on a file opened in text mode? Does it seek to a
character offset or to a byte offset? I need the latter behavior. If I
can't get it I will have to find a different approach. 

The files I'm working with use a nonstandard end-of-string character in the
same fashion as C null-terminated strings. Is there a builtin function that
will read a file "from seek position until seeing EOS character X"? The
methods I see for this online seem to amount to reading one character at a
time and checking manually, which seems nonoptimal to me. 


[0] The file is an SNES ROM dump, but I don't think that matters. 
[1] I'm using Python 3, if it's relevant. 

-- 

Andrew

[toc] | [next] | [standalone]

#52604

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-08-16 19:12 +0000
Message-ID	<520e7982$0$30000$c3e8da3$5496439d@news.astraweb.com>
In reply to	#52590

On Fri, 16 Aug 2013 10:02:08 -0400, Andrew wrote:

> I have a mixed binary/text file[0], and the text portions use a
> radically nonstandard character set. I want to read them easily given
> information about the character encoding and an offset for the beginning
> of a string.

"Mixed binary/text" is not a helpful model to use. You are better off 
thinking of the file as "binary", where some of the fields happen to 
contain text encoded with some custom codec.

If you try opening the file in text mode, you'll very likely break the 
binary parts (e.g. converting the two bytes 0x0D0A to a single byte 
0x0A). So best to stick to binary only, extract the "text" portions of 
the file, then explicitly decode them.

> The descriptions of the codecs module and codecs.register() in
> particular seem to suggest that this is already supported in the
> standard library. However, I can't find any examples of its proper use.
> Most people who use the module seem to want to read utf files in python
> 2.x.[1] I would like to know how to correctly set up a new codec for
> reading files that have nonstandard encodings.

I suggest you look at the source code for the dozens of codecs in the 
standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

(Adjust for your installation location as required.)

> I have two other related questions:
> 
> How does seek() work on a file opened in text mode? Does it seek to a
> character offset or to a byte offset? I need the latter behavior. If I
> can't get it I will have to find a different approach.

For text files, seek() is only legal for offsets that tell() can return, 
but this is not enforced, so you can get nasty rubbish like this:

py> f = open('/tmp/t', 'w', encoding='utf-32')
py> f.write('hello world')
11
py> f.close()
py> f = open('/tmp/t', 'r', encoding='utf-32')
py> f.read(1)
'h'
py> f.tell()
8
py> f.seek(3)
3
py> f.read(1)
'栀'

So I prefer not to seek in text files if I can help it.

> The files I'm working with use a nonstandard end-of-string character in
> the same fashion as C null-terminated strings. Is there a builtin
> function that will read a file "from seek position until seeing EOS
> character X"? The methods I see for this online seem to amount to
> reading one character at a time and checking manually, which seems
> nonoptimal to me.

How do you think such a built-in function would work, if not inspect each 
character until the EOS character is seen? :-)

There is no such built-in function though. By default, Python files are 
buffered, so it won't literally read one character from disk at a time. 
The actual disk IO will read a bunch of bytes into a memory buffer, and 
then read from the buffer.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#52608

From	Andrew <andrew@invalid.invalid>
Date	2013-08-16 16:16 -0400
Message-ID	<v7msphss82ld.xtldrmq8awtj$.dlg@40tude.net>
In reply to	#52604

On 16 Aug 2013 19:12:02 GMT, Steven D'Aprano wrote:

> If you try opening the file in text mode, you'll very likely break the 
> binary parts (e.g. converting the two bytes 0x0D0A to a single byte 
> 0x0A). So best to stick to binary only, extract the "text" portions of 
> the file, then explicitly decode them.

Okay, I'll do that. Given what you said about seek() and text mode below, I
have no choice anyway. 

>> I would like to know how to correctly set up a new codec for
>> reading files that have nonstandard encodings.
> 
> I suggest you look at the source code for the dozens of codecs in the 
> standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

I'll do that too. My thanks for the pointer.

>> How does seek() work on a file opened in text mode? Does it seek to a
>> character offset or to a byte offset? I need the latter behavior. If I
>> can't get it I will have to find a different approach.
> 
> For text files, seek() is only legal for offsets that tell() can return, 
> but this is not enforced, so you can get nasty rubbish like this:
> 
> <snip evil>
> 
> So I prefer not to seek in text files if I can help it.

If I'm understanding the above right, it seeks to a byte offset but the
behavior is undocumented, not guaranteed, shouldn't be used, etc. That
would actually work for me in theory (because I have exact byte offsets to
work with) but I think I'll avoid it anyway, on the grounds that relying on
undocumented behavior is bad. 

>> The files I'm working with use a nonstandard end-of-string character in
>> the same fashion as C null-terminated strings. Is there a builtin
>> function that will read a file "from seek position until seeing EOS
>> character X"? The methods I see for this online seem to amount to
>> reading one character at a time and checking manually, which seems
>> nonoptimal to me.
> 
> How do you think such a built-in function would work, if not inspect each 
> character until the EOS character is seen? :-)

I don't know, but I'm assuming it wouldn't involve a function call to
file.read(1) for each character, and that's what Google keeps handing me.
Such an approach fills me with horror. :-) I suppose there's nothing
stopping me from reading some educated guess at the length of the string
and then stepping through the result. Or I'll look at the readline() source
and see how it does its thing.

> There is no such built-in function though. By default, Python files are 
> buffered, so it won't literally read one character from disk at a time. 
> The actual disk IO will read a bunch of bytes into a memory buffer, and 
> then read from the buffer.

I'd guessed as much, but assumed there was still ridiculous function call
overhead involved in the repeated read(1) method above. Of course, trying
to avoid said overhead is premature optimization; my interest in doing so
is more aesthetic than anything else. 

Thanks for the help.

-- 

Andrew

[toc] | [prev] | [next] | [standalone]

#52610

From	Chris Angelico <rosuav@gmail.com>
Date	2013-08-16 23:14 +0100
Message-ID	<mailman.7.1376691263.23369.python-list@python.org>
In reply to	#52590

On Fri, Aug 16, 2013 at 3:02 PM, Andrew <andrew@invalid.invalid> wrote:
> I have a mixed binary/text file[0], and the text portions use a radically
> nonstandard character set. I want to read them easily given information
> about the character encoding and an offset for the beginning of a string.

To add to all the information already given: Is the file small enough
to comfortably fit into memory? If so, you'll find it a LOT easier to
play with strings in RAM than files on disk. Even if not, you may find
a lot of tasks simplified by just reading a kay or a meg in and then
working within that. That spares you the fiddliness of read(1) all the
time, at the expense of potentially reading more than you need.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Proper use of the codecs module.

Contents

#52590 — Proper use of the codecs module.

#52604

#52608

#52610