Groups > comp.lang.python > #63954 > unrolled thread

Re: python-list@python.org

Started by	Florian Lindner <mailinglists@xgm.de>
First post	2014-01-15 02:25 +0100
Last post	2014-01-16 11:52 +1100
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: python-list@python.org Florian Lindner <mailinglists@xgm.de> - 2014-01-15 02:25 +0100
    Re: python-list@python.org Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:38 +0000
      Re: python-list@python.org Ben Finney <ben+python@benfinney.id.au> - 2014-01-16 11:52 +1100

#63954 — Re: python-list@python.org

From	Florian Lindner <mailinglists@xgm.de>
Date	2014-01-15 02:25 +0100
Subject	Re: python-list@python.org
Message-ID	<mailman.5488.1389749137.18130.python-list@python.org>

Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB:
> On 2014-01-14 16:37, Florian Lindner wrote:
> > Hello!
> >
> > I'm using python 3.2.3 on debian wheezy. My script is called from my mail delivery agent (MDA) maildrop (like procmail) through it's xfilter directive.
> >
> > Script works fine when used interactively, e.g. ./script.py < testmail but when called from maildrop it's producing an infamous UnicodeDecodeError:
> >
> > File "/home/flindner/flofify.py", line 171, in main
> >       mail = sys.stdin.read()
> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
> >       return codecs.ascii_decode(input, self.errors)[0]
> >
> > Exception for example is always like
> >
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 869: ordinal not in range(128)
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1176: ordinal not in range(128)
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x8c in position 846: ordinal not in range(128)
> >
> > I read mail from stdin "mail = sys.stdin.read()"
> >
> > Environment when called is:
> >
> > locale.getpreferredencoding(): ANSI_X3.4-1968
> > environ["LANG"]: C
> >
> > System environment when using shell is:
> >
> > ~ % echo $LANG
> > en_US.UTF-8
> >
> > As far as I know when reading from stdin I don't need an decode(...) call, since stdin has a decoding. I also tried some decoding/encoding stuff but changed nothing.
> >
> > Any ideas to help me?
> >
> When run from maildrop it thinks that the encoding of stdin is ASCII.

Well, true. But what encoding does maildrop actually gives me? It obviously does not inherit LANG or is called from the MTA that way. I also tried:

        inData = codecs.getreader('utf-8')(sys.stdin)                                                                                                                                                                                                                        
        mail = inData.read()                                                                                                                                                                                                                                                 

Failed also. But I'm not exactly an encoding expert.

Regards,
Florian

[toc] | [next] | [standalone]

#64028

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-16 00:38 +0000
Message-ID	<52d729fd$0$29970$c3e8da3$5496439d@news.astraweb.com>
In reply to	#63954

On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote:

> Am Dienstag, 14. Januar 2014, 17:00:48 schrieb MRAB:
>> On 2014-01-14 16:37, Florian Lindner wrote:
>> > Hello!
>> >
>> > I'm using python 3.2.3 on debian wheezy. My script is called from my
>> > mail delivery agent (MDA) maildrop (like procmail) through it's
>> > xfilter directive.
>> >
>> > Script works fine when used interactively, e.g. ./script.py <
>> > testmail but when called from maildrop it's producing an infamous
>> > UnicodeDecodeError:

What's maildrop? When using third party libraries, it's often helpful to 
point to give some detail on what they are and where they are from.

>> > File "/home/flindner/flofify.py", line 171, in main
>> >       mail = sys.stdin.read()

What's the value of sys.stdin? If you call this from your script:

print(sys.stdin)

what do you get? Is it possible that the mysterious maildrop is messing 
stdin up?

>> > File "/usr/lib/python3.2/encodings/ascii.py", line 26, in decode
>> >       return codecs.ascii_decode(input, self.errors)[0]
>> >
>> > Exception for example is always like
>> >
>> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position
>> > 869: ordinal not in range(128) 

That makes perfect sense: byte 0x82 is not in the ASCII range. ASCII is 
limited to bytes values 0 through 127, and 0x82 is hex for 130. So the 
error message is telling you *exactly* what the problem is: your email 
contains a non-ASCII character, with byte value 0x82.

How can you deal with this?

(1) "Oh gods, I can't deal with this, I wish the whole world was America 
in 1965 (except even back then, there were English characters in common 
use that can't be represented in ASCII)! I'm going to just drop anything 
that isn't ASCII and hope it doesn't mangle the message *too* badly!"

You need to set the error handler to 'ignore'. How you do that may depend 
on whether or not maildrop is monkeypatching stdin.

(2) "Likewise, but instead of dropping the offending bytes, I'll replace 
them with something that makes it obvious that an error has occurred."

Set the error handler to "replace". You'll still mangle the email, but it 
will be more obvious that you mangled it.

(3) "ASCII? Why am I trying to read email as ASCII? That's not right. 
Email can contain arbitrary bytes, and is not limited to pure ASCII. I 
need to work out which encoding the email is using, but even that is not 
enough, since emails sometimes contain the wrong encoding information or 
invalid bytes. Especially spam, that's particularly poor. (What a 
surprise, that spammers don't bother to spend the time to get their code 
right?) Hmmm... maybe I ought to use an email library that actually gets 
these issues *right*?"

What does the maildrop documentation say about encodings and/or malformed 
email?

>> > I read mail from stdin "mail = sys.stdin.read()"
>> >
>> > Environment when called is:
>> >
>> > locale.getpreferredencoding(): ANSI_X3.4-1968 environ["LANG"]: C

For a modern Linux system to be using the C encoding is not a good sign. 
It's not 1970 anymore. I would expect it should be using UTF-8. But I 
don't think that's relevant to your problem (although a mis-configured 
system may make it worse).

>> > System environment when using shell is:
>> >
>> > ~ % echo $LANG
>> > en_US.UTF-8

That's looking more promising.

>> > As far as I know when reading from stdin I don't need an decode(...)
>> > call, since stdin has a decoding. 

That depends on what stdin actually is. Please print it and show us.

Also, can you do a visual inspection of the email that is failing? If 
it's spam, perhaps you can just drop it from the queue and deal with this 
issue later.

>> > I also tried some decoding/encoding
>> > stuff but changed nothing.

Ah, but did you try the right stuff? (Randomly perturbing your code in 
the hope that the error will go away is not a winning strategy.)

>> > Any ideas to help me?
>> >
>> When run from maildrop it thinks that the encoding of stdin is ASCII.
> 
> Well, true. But what encoding does maildrop actually gives me? It
> obviously does not inherit LANG or is called from the MTA that way.

Who knows? What's maildrop? What does its documentation say about 
encodings? The fact that it is using ASCII apparently by default does not 
give me confidence that it knows how to deal with 8-bit emails, but I 
might be completely wrong.

> I also tried:
> 
>         inData = codecs.getreader('utf-8')(sys.stdin) 
>         mail = inData.read()
> 
> Failed also. But I'm not exactly an encoding expert.

Failed how? Please copy and paste your exact exception traceback, in full.

Ultimately, dealing with email is a hard problem. So long as you only 
receive 7-bit ASCII mail, you don't realise how hard it is. But the 
people who write the mail libraries -- at least the good ones -- know 
just how hard it really is. You can have 8-bit emails with no encoding 
set, or the wrong encoding, or the right encoding but the contents then 
includes invalid bytes. It's not just spammers who get it wrong, 
legitimate programmers sending email also screw up.

Email is worse than the 90/10 rule. 90% of the effort is needed to deal 
with 1% of the emails. (More if you have a really bad spam problem.) You 
should look at a good email library, like the one in the std lib which I 
believe gets most of these issues right.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#64031

From	Ben Finney <ben+python@benfinney.id.au>
Date	2014-01-16 11:52 +1100
Message-ID	<mailman.5553.1389833576.18130.python-list@python.org>
In reply to	#64028

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

> On Wed, 15 Jan 2014 02:25:34 +0100, Florian Lindner wrote:
> >> On 2014-01-14 16:37, Florian Lindner wrote:
> >> > I'm using python 3.2.3 on debian wheezy. My script is called from
> >> > my mail delivery agent (MDA) maildrop (like procmail) through
> >> > it's xfilter directive.
> >> >
> >> > Script works fine when used interactively, e.g. ./script.py <
> >> > testmail but when called from maildrop it's producing an infamous
> >> > UnicodeDecodeError:
>
> What's maildrop? When using third party libraries, it's often helpful to 
> point to give some detail on what they are and where they are from.

It's not a library; as he says, it's an MDA program. It is from the
Courier mail application <URL:http://www.courier-mta.org/maildrop/>.

>From that, I understand Florian to be saying his Python program is
invoked via command-line from some configuration directive for Maildrop.

> What does the maildrop documentation say about encodings and/or
> malformed email?

I think this is the more likely line of enquiry to diagnose the problem.

> For a modern Linux system to be using the C encoding is not a good
> sign.

That's true, but it's likely a configuration problem: the encoding needs
to be set *and* obeyed at an administrative and user-profile level.

> It's not 1970 anymore. I would expect it should be using UTF-8. But I 
> don't think that's relevant to your problem (although a mis-configured 
> system may make it worse).

Since the MDA runs usually not as a system service, but rather at a
user-specific level, I would expect some interaction of the host locale
and the user-specific locale is the problem.

> Who knows? What's maildrop? What does its documentation say about 
> encodings?

I hope the original poster enjoys manpages, since that's how the program
is documented <URL:http://www.courier-mta.org/maildrop/documentation.html>.

> The fact that it is using ASCII apparently by default does not give me
> confidence that it knows how to deal with 8-bit emails, but I might be
> completely wrong.

I've found that the problem is often that Python is the party assuming
that stdin and stdout are ASCII, largely because it hasn't been told
otherwise.

-- 
 \        “The greatest tragedy in mankind's entire history may be the |
  `\       hijacking of morality by religion.” —Arthur C. Clarke, 1991 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [standalone]

csiph-web

Re: python-list@python.org

Contents

#63954 — Re: python-list@python.org

#64028

#64031