Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #52590 > unrolled thread
| Started by | Andrew <andrew@invalid.invalid> |
|---|---|
| First post | 2013-08-16 10:02 -0400 |
| Last post | 2013-08-16 23:14 +0100 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.lang.python
Proper use of the codecs module. Andrew <andrew@invalid.invalid> - 2013-08-16 10:02 -0400
Re: Proper use of the codecs module. Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-16 19:12 +0000
Re: Proper use of the codecs module. Andrew <andrew@invalid.invalid> - 2013-08-16 16:16 -0400
Re: Proper use of the codecs module. Chris Angelico <rosuav@gmail.com> - 2013-08-16 23:14 +0100
| From | Andrew <andrew@invalid.invalid> |
|---|---|
| Date | 2013-08-16 10:02 -0400 |
| Subject | Proper use of the codecs module. |
| Message-ID | <1efhl8i0dmr9b.15q8opn6p0cj3.dlg@40tude.net> |
I have a mixed binary/text file[0], and the text portions use a radically nonstandard character set. I want to read them easily given information about the character encoding and an offset for the beginning of a string. The descriptions of the codecs module and codecs.register() in particular seem to suggest that this is already supported in the standard library. However, I can't find any examples of its proper use. Most people who use the module seem to want to read utf files in python 2.x.[1] I would like to know how to correctly set up a new codec for reading files that have nonstandard encodings. I have two other related questions: How does seek() work on a file opened in text mode? Does it seek to a character offset or to a byte offset? I need the latter behavior. If I can't get it I will have to find a different approach. The files I'm working with use a nonstandard end-of-string character in the same fashion as C null-terminated strings. Is there a builtin function that will read a file "from seek position until seeing EOS character X"? The methods I see for this online seem to amount to reading one character at a time and checking manually, which seems nonoptimal to me. [0] The file is an SNES ROM dump, but I don't think that matters. [1] I'm using Python 3, if it's relevant. -- Andrew
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-08-16 19:12 +0000 |
| Message-ID | <520e7982$0$30000$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #52590 |
On Fri, 16 Aug 2013 10:02:08 -0400, Andrew wrote:
> I have a mixed binary/text file[0], and the text portions use a
> radically nonstandard character set. I want to read them easily given
> information about the character encoding and an offset for the beginning
> of a string.
"Mixed binary/text" is not a helpful model to use. You are better off
thinking of the file as "binary", where some of the fields happen to
contain text encoded with some custom codec.
If you try opening the file in text mode, you'll very likely break the
binary parts (e.g. converting the two bytes 0x0D0A to a single byte
0x0A). So best to stick to binary only, extract the "text" portions of
the file, then explicitly decode them.
> The descriptions of the codecs module and codecs.register() in
> particular seem to suggest that this is already supported in the
> standard library. However, I can't find any examples of its proper use.
> Most people who use the module seem to want to read utf files in python
> 2.x.[1] I would like to know how to correctly set up a new codec for
> reading files that have nonstandard encodings.
I suggest you look at the source code for the dozens of codecs in the
standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py
(Adjust for your installation location as required.)
> I have two other related questions:
>
> How does seek() work on a file opened in text mode? Does it seek to a
> character offset or to a byte offset? I need the latter behavior. If I
> can't get it I will have to find a different approach.
For text files, seek() is only legal for offsets that tell() can return,
but this is not enforced, so you can get nasty rubbish like this:
py> f = open('/tmp/t', 'w', encoding='utf-32')
py> f.write('hello world')
11
py> f.close()
py> f = open('/tmp/t', 'r', encoding='utf-32')
py> f.read(1)
'h'
py> f.tell()
8
py> f.seek(3)
3
py> f.read(1)
'栀'
So I prefer not to seek in text files if I can help it.
> The files I'm working with use a nonstandard end-of-string character in
> the same fashion as C null-terminated strings. Is there a builtin
> function that will read a file "from seek position until seeing EOS
> character X"? The methods I see for this online seem to amount to
> reading one character at a time and checking manually, which seems
> nonoptimal to me.
How do you think such a built-in function would work, if not inspect each
character until the EOS character is seen? :-)
There is no such built-in function though. By default, Python files are
buffered, so it won't literally read one character from disk at a time.
The actual disk IO will read a bunch of bytes into a memory buffer, and
then read from the buffer.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Andrew <andrew@invalid.invalid> |
|---|---|
| Date | 2013-08-16 16:16 -0400 |
| Message-ID | <v7msphss82ld.xtldrmq8awtj$.dlg@40tude.net> |
| In reply to | #52604 |
On 16 Aug 2013 19:12:02 GMT, Steven D'Aprano wrote: > If you try opening the file in text mode, you'll very likely break the > binary parts (e.g. converting the two bytes 0x0D0A to a single byte > 0x0A). So best to stick to binary only, extract the "text" portions of > the file, then explicitly decode them. Okay, I'll do that. Given what you said about seek() and text mode below, I have no choice anyway. >> I would like to know how to correctly set up a new codec for >> reading files that have nonstandard encodings. > > I suggest you look at the source code for the dozens of codecs in the > standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py I'll do that too. My thanks for the pointer. >> How does seek() work on a file opened in text mode? Does it seek to a >> character offset or to a byte offset? I need the latter behavior. If I >> can't get it I will have to find a different approach. > > For text files, seek() is only legal for offsets that tell() can return, > but this is not enforced, so you can get nasty rubbish like this: > > <snip evil> > > So I prefer not to seek in text files if I can help it. If I'm understanding the above right, it seeks to a byte offset but the behavior is undocumented, not guaranteed, shouldn't be used, etc. That would actually work for me in theory (because I have exact byte offsets to work with) but I think I'll avoid it anyway, on the grounds that relying on undocumented behavior is bad. >> The files I'm working with use a nonstandard end-of-string character in >> the same fashion as C null-terminated strings. Is there a builtin >> function that will read a file "from seek position until seeing EOS >> character X"? The methods I see for this online seem to amount to >> reading one character at a time and checking manually, which seems >> nonoptimal to me. > > How do you think such a built-in function would work, if not inspect each > character until the EOS character is seen? :-) I don't know, but I'm assuming it wouldn't involve a function call to file.read(1) for each character, and that's what Google keeps handing me. Such an approach fills me with horror. :-) I suppose there's nothing stopping me from reading some educated guess at the length of the string and then stepping through the result. Or I'll look at the readline() source and see how it does its thing. > There is no such built-in function though. By default, Python files are > buffered, so it won't literally read one character from disk at a time. > The actual disk IO will read a bunch of bytes into a memory buffer, and > then read from the buffer. I'd guessed as much, but assumed there was still ridiculous function call overhead involved in the repeated read(1) method above. Of course, trying to avoid said overhead is premature optimization; my interest in doing so is more aesthetic than anything else. Thanks for the help. -- Andrew
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-08-16 23:14 +0100 |
| Message-ID | <mailman.7.1376691263.23369.python-list@python.org> |
| In reply to | #52590 |
On Fri, Aug 16, 2013 at 3:02 PM, Andrew <andrew@invalid.invalid> wrote: > I have a mixed binary/text file[0], and the text portions use a radically > nonstandard character set. I want to read them easily given information > about the character encoding and an offset for the beginning of a string. To add to all the information already given: Is the file small enough to comfortably fit into memory? If so, you'll find it a LOT easier to play with strings in RAM than files on disk. Even if not, you may find a lot of tasks simplified by just reading a kay or a meg in and then working within that. That spares you the fiddliness of read(1) all the time, at the expense of potentially reading more than you need. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web