Re: catch UnicodeDecodeError

From	Stefan Behnel <stefan_ml@behnel.de>
Subject	Re: catch UnicodeDecodeError
Date	2012-07-26 10:28 +0200
References	<04f7ff8d-9881-4a04-ab2e-b5573b5f3cd1@googlegroups.com> <mailman.2570.1343216119.4697.python-list@python.org> <b8723e64-12fa-4e53-8914-8f2b8e9c0f1d@googlegroups.com> <mailman.2581.1343242258.4697.python-list@python.org> <38f5cdaf-c021-4ccd-8fcb-c68b21d3aeb2@w24g2000vby.googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.2593.1343291337.4697.python-list@python.org> (permalink)

Show all headers | View raw

Jaroslav Dobrek, 26.07.2012 09:46:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.

Yes, that's the standard procedure. Decode on the way in, encode on the way
out, use Unicode everywhere in between.


> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
> 
> for line in f:

And the cool thing is: you can! :)

In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:

    from io import open

    filename = "somefile.txt"
    try:
        with open(filename, encoding="utf-8") as f:
            for line in f:
                process_line(line)  # actually, I'd use "process_file(f)"
    except IOError, e:
        print("Reading file %s failed: %s" % (filename, e))
    except UnicodeDecodeError, e:
        print("Some error occurred decoding file %s: %s" % (filename, e))


Ok, maybe with a better way to handle the errors than "print" ...

For older Python versions, you'd use "codecs.open()" instead. That's a bit
messy, but only because it was finally cleaned up for Python 3.


> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
> 
> try:
>     for line in f: # here text is decoded implicitly
>        do_something()
> except UnicodeDecodeError():
>     do_something_different()
> 
> This isn't possible for syntactic reasons.

Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.


> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.

Why would that be the case? The places to change should be very local in
your code.

Stefan

Thread

catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 04:05 -0700
  Re: catch UnicodeDecodeError Andrew Berg <bahamutzero8825@gmail.com> - 2012-07-25 06:34 -0500
  Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-25 13:35 +0200
    Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 05:09 -0700
    Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-25 05:09 -0700
      Re: catch UnicodeDecodeError Dave Angel <d@davea.name> - 2012-07-25 14:50 -0400
        Re: catch UnicodeDecodeError Jaroslav Dobrek <jaroslav.dobrek@gmail.com> - 2012-07-26 00:46 -0700
          Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 10:28 +0200
            Re: catch UnicodeDecodeError Jaroslav Dobrek <jaroslav.dobrek@gmail.com> - 2012-07-26 03:51 -0700
              Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 13:15 +0200
                Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-26 04:58 -0700
                Re: catch UnicodeDecodeError jaroslav.dobrek@gmail.com - 2012-07-26 04:58 -0700
              Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-26 14:17 +0200
              Re: catch UnicodeDecodeError Stefan Behnel <stefan_ml@behnel.de> - 2012-07-26 14:24 +0200
          Re: catch UnicodeDecodeError Chris Angelico <rosuav@gmail.com> - 2012-07-26 19:46 +1000
          Re: catch UnicodeDecodeError wxjmfauth@gmail.com - 2012-07-26 03:19 -0700
      Re: catch UnicodeDecodeError Philipp Hagemeister <phihag@phihag.de> - 2012-07-26 14:43 +0200

csiph-web