Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #26071
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Subject | Re: catch UnicodeDecodeError |
| Date | 2012-07-26 10:28 +0200 |
| References | <04f7ff8d-9881-4a04-ab2e-b5573b5f3cd1@googlegroups.com> <mailman.2570.1343216119.4697.python-list@python.org> <b8723e64-12fa-4e53-8914-8f2b8e9c0f1d@googlegroups.com> <mailman.2581.1343242258.4697.python-list@python.org> <38f5cdaf-c021-4ccd-8fcb-c68b21d3aeb2@w24g2000vby.googlegroups.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.2593.1343291337.4697.python-list@python.org> (permalink) |
Jaroslav Dobrek, 26.07.2012 09:46:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.
Yes, that's the standard procedure. Decode on the way in, encode on the way
out, use Unicode everywhere in between.
> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
>
> for line in f:
And the cool thing is: you can! :)
In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:
from io import open
filename = "somefile.txt"
try:
with open(filename, encoding="utf-8") as f:
for line in f:
process_line(line) # actually, I'd use "process_file(f)"
except IOError, e:
print("Reading file %s failed: %s" % (filename, e))
except UnicodeDecodeError, e:
print("Some error occurred decoding file %s: %s" % (filename, e))
Ok, maybe with a better way to handle the errors than "print" ...
For older Python versions, you'd use "codecs.open()" instead. That's a bit
messy, but only because it was finally cleaned up for Python 3.
> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
>
> try:
> for line in f: # here text is decoded implicitly
> do_something()
> except UnicodeDecodeError():
> do_something_different()
>
> This isn't possible for syntactic reasons.
Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.
> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.
Why would that be the case? The places to change should be very local in
your code.
Stefan
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 04:05 -0700
Re: catch UnicodeDecodeError Andrew Berg <bahamutzero8825@gmail.com> - 2012-07-25 06:34 -0500
Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-25 13:35 +0200
Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 05:09 -0700
Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 05:09 -0700
Re: catch UnicodeDecodeError Dave Angel <d@davea.name> - 2012-07-25 14:50 -0400
Re: catch UnicodeDecodeError Jaroslav Dobrek <jaroslav.dobrek@gmail.com> - 2012-07-26 00:46 -0700
Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 10:28 +0200
Re: catch UnicodeDecodeError Jaroslav Dobrek <jaroslav.dobrek@gmail.com> - 2012-07-26 03:51 -0700
Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 13:15 +0200
Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-26 04:58 -0700
Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-26 04:58 -0700
Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-26 14:17 +0200
Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 14:24 +0200
Re: catch UnicodeDecodeError Chris Angelico <rosuav@gmail.com> - 2012-07-26 19:46 +1000
Re: catch UnicodeDecodeError wxjmfauth@gmail.com - 2012-07-26 03:19 -0700
Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-26 14:43 +0200
csiph-web