Groups > comp.lang.python > #196919 > unrolled thread

Printing UTF-8 mail to terminal

Started by	"Loris Bennett" <loris.bennett@fu-berlin.de>
First post	2024-10-31 16:33 +0100
Last post	2024-11-02 08:44 +1100
Articles	17 — 7 participants

Back to article view | Back to comp.lang.python

  Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-10-31 16:33 +0100
    Re: Printing UTF-8 mail to terminal Left Right <olegsivokon@gmail.com> - 2024-10-31 17:38 +0100
      Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-01 07:52 +0100
        Re: Printing UTF-8 mail to terminal Inada Naoki <songofacandy@gmail.com> - 2024-11-03 12:08 +0900
          Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-04 11:48 +0100
    Re: Printing UTF-8 mail to terminal (Posting On Python-List Prohibited) Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-10-31 19:35 +0000
    Re: Printing UTF-8 mail to terminal Cameron Simpson <cs@cskk.id.au> - 2024-11-01 07:50 +1100
      Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-01 08:11 +0100
        Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-01 10:10 +0100
          Re: Printing UTF-8 mail to terminal dieter.maurer@online.de - 2024-11-01 17:38 +0100
          Re: Printing UTF-8 mail to terminal Cameron Simpson <cs@cskk.id.au> - 2024-11-02 08:47 +1100
            Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-04 11:44 +0100
              Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-04 11:57 +0100
                Re: Printing UTF-8 mail to terminal "Loris Bennett" <loris.bennett@fu-berlin.de> - 2024-11-04 13:02 +0100
                  Re: Printing UTF-8 mail to terminal "Peter J. Holzer" <hjp-python@hjp.at> - 2024-11-05 21:39 +0100
                  Re: Printing UTF-8 mail to terminal Cameron Simpson <cs@cskk.id.au> - 2024-11-06 08:20 +1100
        Re: Printing UTF-8 mail to terminal Cameron Simpson <cs@cskk.id.au> - 2024-11-02 08:44 +1100

#196919 — Printing UTF-8 mail to terminal

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-10-31 16:33 +0100
Subject	Printing UTF-8 mail to terminal
Message-ID	<878qu49tii.fsf@zedat.fu-berlin.de>

Hi,

I have a command-line program which creates an email containing German
umlauts.  On receiving the mail, my mail client displays the subject and
body correctly:

  Subject: Übung

  Sehr geehrter Herr Dr. Bennett,

  Dies ist eine Übung.

So far, so good.  However, when I use the --verbose option to print
the mail to the terminal via

  if args.verbose:
      print(mail)

I get:

  Subject: Übungsbetreff

  Sehr geehrter Herr Dr. Bennett,

  Dies ist eine =C3=9Cbung.

What do I need to do to prevent the body from getting mangled?

I seem to remember that I had issues in the past with a Perl version of
a similar program.  As far as I recall there was an issue with fact the
greeting is generated by querying a server, whereas the body is being
read from a file, which lead to oddities when the two bits were
concatenated.  But that might just have been a Perl thing. 

Cheers,

Loris

-- 
This signature is currently under constuction.

[toc] | [next] | [standalone]

#196922

From	Left Right <olegsivokon@gmail.com>
Date	2024-10-31 17:38 +0100
Message-ID	<mailman.61.1730392745.4695.python-list@python.org>
In reply to	#196919

There's quite a lot of misuse of terminology around terminal / console
/ shell.  Please, correct me if I'm wrong, but it looks like you are
printing that on MS Windows, right?  MS Windows doesn't have or use
terminals (that's more of a Unix-related concept). And, by "terminal"
I mean terminal emulator (i.e. a program that emulates the behavior of
a physical terminal). You can, of course, find some terminal programs
for windows (eg. mintty), but I doubt that that's what you are dealing
with.

What MS Windows users usually end up using is the console.  If you
run, eg. cmd.exe, it will create a process that displays a graphical
console.  The console uses an encoding scheme to represent the text
output.  I believe that the default on MS Windows is to use some
single-byte encoding. This answer from SE family site tells you how to
set the console encoding to UTF-8 permanently:
https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8
, which, I believe, will solve your problem with how the text is
displayed.

On Thu, Oct 31, 2024 at 5:19 PM Loris Bennett via Python-list
<python-list@python.org> wrote:
>
> Hi,
>
> I have a command-line program which creates an email containing German
> umlauts.  On receiving the mail, my mail client displays the subject and
> body correctly:
>
>   Subject: Übung
>
>   Sehr geehrter Herr Dr. Bennett,
>
>   Dies ist eine Übung.
>
> So far, so good.  However, when I use the --verbose option to print
> the mail to the terminal via
>
>   if args.verbose:
>       print(mail)
>
> I get:
>
>   Subject: Übungsbetreff
>
>   Sehr geehrter Herr Dr. Bennett,
>
>   Dies ist eine =C3=9Cbung.
>
> What do I need to do to prevent the body from getting mangled?
>
> I seem to remember that I had issues in the past with a Perl version of
> a similar program.  As far as I recall there was an issue with fact the
> greeting is generated by querying a server, whereas the body is being
> read from a file, which lead to oddities when the two bits were
> concatenated.  But that might just have been a Perl thing.
>
> Cheers,
>
> Loris
>
> --
> This signature is currently under constuction.
> --
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#196928

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-01 07:52 +0100
Message-ID	<87v7x7o37z.fsf@zedat.fu-berlin.de>
In reply to	#196922

Left Right <olegsivokon@gmail.com> writes:

> There's quite a lot of misuse of terminology around terminal / console
> / shell.  Please, correct me if I'm wrong, but it looks like you are
> printing that on MS Windows, right?  MS Windows doesn't have or use
> terminals (that's more of a Unix-related concept). And, by "terminal"
> I mean terminal emulator (i.e. a program that emulates the behavior of
> a physical terminal). You can, of course, find some terminal programs
> for windows (eg. mintty), but I doubt that that's what you are dealing
> with.
>
> What MS Windows users usually end up using is the console.  If you
> run, eg. cmd.exe, it will create a process that displays a graphical
> console.  The console uses an encoding scheme to represent the text
> output.  I believe that the default on MS Windows is to use some
> single-byte encoding. This answer from SE family site tells you how to
> set the console encoding to UTF-8 permanently:
> https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8
> , which, I believe, will solve your problem with how the text is
> displayed.

I'm not using MS Windows.  I am using a Gnome terminal on Debian 12
locally and connecting via SSH to a AlmaLinux 8 server, where I start a
tmux session. 

> On Thu, Oct 31, 2024 at 5:19 PM Loris Bennett via Python-list
> <python-list@python.org> wrote:
>>
>> Hi,
>>
>> I have a command-line program which creates an email containing German
>> umlauts.  On receiving the mail, my mail client displays the subject and
>> body correctly:
>>
>>   Subject: Übung
>>
>>   Sehr geehrter Herr Dr. Bennett,
>>
>>   Dies ist eine Übung.
>>
>> So far, so good.  However, when I use the --verbose option to print
>> the mail to the terminal via
>>
>>   if args.verbose:
>>       print(mail)
>>
>> I get:
>>
>>   Subject: Übungsbetreff
>>
>>   Sehr geehrter Herr Dr. Bennett,
>>
>>   Dies ist eine =C3=9Cbung.
>>
>> What do I need to do to prevent the body from getting mangled?
>>
>> I seem to remember that I had issues in the past with a Perl version of
>> a similar program.  As far as I recall there was an issue with fact the
>> greeting is generated by querying a server, whereas the body is being
>> read from a file, which lead to oddities when the two bits were
>> concatenated.  But that might just have been a Perl thing.
>>
>> Cheers,
>>
>> Loris
>>
>> --
>> This signature is currently under constuction.
>> --
>> https://mail.python.org/mailman/listinfo/python-list
-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

[toc] | [prev] | [next] | [standalone]

#196948

From	Inada Naoki <songofacandy@gmail.com>
Date	2024-11-03 12:08 +0900
Message-ID	<mailman.75.1730603335.4695.python-list@python.org>
In reply to	#196928

Try PYTHONUTF8=1 envver.

2024年11月2日(土) 0:36 Loris Bennett via Python-list <python-list@python.org>:

> Left Right <olegsivokon@gmail.com> writes:
>
> > There's quite a lot of misuse of terminology around terminal / console
> > / shell.  Please, correct me if I'm wrong, but it looks like you are
> > printing that on MS Windows, right?  MS Windows doesn't have or use
> > terminals (that's more of a Unix-related concept). And, by "terminal"
> > I mean terminal emulator (i.e. a program that emulates the behavior of
> > a physical terminal). You can, of course, find some terminal programs
> > for windows (eg. mintty), but I doubt that that's what you are dealing
> > with.
> >
> > What MS Windows users usually end up using is the console.  If you
> > run, eg. cmd.exe, it will create a process that displays a graphical
> > console.  The console uses an encoding scheme to represent the text
> > output.  I believe that the default on MS Windows is to use some
> > single-byte encoding. This answer from SE family site tells you how to
> > set the console encoding to UTF-8 permanently:
> >
> https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8
> > , which, I believe, will solve your problem with how the text is
> > displayed.
>
> I'm not using MS Windows.  I am using a Gnome terminal on Debian 12
> locally and connecting via SSH to a AlmaLinux 8 server, where I start a
> tmux session.
>
> > On Thu, Oct 31, 2024 at 5:19 PM Loris Bennett via Python-list
> > <python-list@python.org> wrote:
> >>
> >> Hi,
> >>
> >> I have a command-line program which creates an email containing German
> >> umlauts.  On receiving the mail, my mail client displays the subject and
> >> body correctly:
> >>
> >>   Subject: Übung
> >>
> >>   Sehr geehrter Herr Dr. Bennett,
> >>
> >>   Dies ist eine Übung.
> >>
> >> So far, so good.  However, when I use the --verbose option to print
> >> the mail to the terminal via
> >>
> >>   if args.verbose:
> >>       print(mail)
> >>
> >> I get:
> >>
> >>   Subject: Übungsbetreff
> >>
> >>   Sehr geehrter Herr Dr. Bennett,
> >>
> >>   Dies ist eine =C3=9Cbung.
> >>
> >> What do I need to do to prevent the body from getting mangled?
> >>
> >> I seem to remember that I had issues in the past with a Perl version of
> >> a similar program.  As far as I recall there was an issue with fact the
> >> greeting is generated by querying a server, whereas the body is being
> >> read from a file, which lead to oddities when the two bits were
> >> concatenated.  But that might just have been a Perl thing.
> >>
> >> Cheers,
> >>
> >> Loris
> >>
> >> --
> >> This signature is currently under constuction.
> >> --
> >> https://mail.python.org/mailman/listinfo/python-list
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT, Freie Universität Berlin
> --
> https://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]

#196951

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-04 11:48 +0100
Message-ID	<87a5efmg0g.fsf@zedat.fu-berlin.de>
In reply to	#196948

Inada Naoki <songofacandy@gmail.com> writes:

> 2024年11月2日(土) 0:36 Loris Bennett via Python-list <python-list@python.org>:
>
>> Left Right <olegsivokon@gmail.com> writes:
>>
>> > There's quite a lot of misuse of terminology around terminal / console
>> > / shell.  Please, correct me if I'm wrong, but it looks like you are
>> > printing that on MS Windows, right?  MS Windows doesn't have or use
>> > terminals (that's more of a Unix-related concept). And, by "terminal"
>> > I mean terminal emulator (i.e. a program that emulates the behavior of
>> > a physical terminal). You can, of course, find some terminal programs
>> > for windows (eg. mintty), but I doubt that that's what you are dealing
>> > with.
>> >
>> > What MS Windows users usually end up using is the console.  If you
>> > run, eg. cmd.exe, it will create a process that displays a graphical
>> > console.  The console uses an encoding scheme to represent the text
>> > output.  I believe that the default on MS Windows is to use some
>> > single-byte encoding. This answer from SE family site tells you how to
>> > set the console encoding to UTF-8 permanently:
>> >
>> https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8
>> > , which, I believe, will solve your problem with how the text is
>> > displayed.
>>
>> I'm not using MS Windows.  I am using a Gnome terminal on Debian 12
>> locally and connecting via SSH to a AlmaLinux 8 server, where I start a
>> tmux session.
>>
>> > On Thu, Oct 31, 2024 at 5:19 PM Loris Bennett via Python-list
>> > <python-list@python.org> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a command-line program which creates an email containing German
>> >> umlauts.  On receiving the mail, my mail client displays the subject and
>> >> body correctly:
>> >>
>> >>   Subject: Übung
>> >>
>> >>   Sehr geehrter Herr Dr. Bennett,
>> >>
>> >>   Dies ist eine Übung.
>> >>
>> >> So far, so good.  However, when I use the --verbose option to print
>> >> the mail to the terminal via
>> >>
>> >>   if args.verbose:
>> >>       print(mail)
>> >>
>> >> I get:
>> >>
>> >>   Subject: Übungsbetreff
>> >>
>> >>   Sehr geehrter Herr Dr. Bennett,
>> >>
>> >>   Dies ist eine =C3=9Cbung.
>> >>
>> >> What do I need to do to prevent the body from getting mangled?
>> >>
>> >> I seem to remember that I had issues in the past with a Perl version of
>> >> a similar program.  As far as I recall there was an issue with fact the
>> >> greeting is generated by querying a server, whereas the body is being
>> >> read from a file, which lead to oddities when the two bits were
>> >> concatenated.  But that might just have been a Perl thing.
>> >>
>
> Try PYTHONUTF8=1 envver.
>

This does not seem to affect the way the email body is printed.

Cheers,

Loris

-- 
This signature is currently under constuction.

[toc] | [prev] | [next] | [standalone]

#196924 — Re: Printing UTF-8 mail to terminal (Posting On Python-List Prohibited)

From	Lawrence D'Oliveiro <ldo@nz.invalid>
Date	2024-10-31 19:35 +0000
Subject	Re: Printing UTF-8 mail to terminal (Posting On Python-List Prohibited)
Message-ID	<vg0m6l$2qq89$2@dont-email.me>
In reply to	#196919

On Thu, 31 Oct 2024 16:33:41 +0100, Loris Bennett wrote:

>   Dies ist eine =C3=9Cbung.
> 
> What do I need to do to prevent the body from getting mangled?

I don’t think that’s actually getting mangled, that is how the actual 
message body looks. What you have there is called “quoted printable” 
encoding, and it’s a standard way to ensure the message body consists only 
of 7-bit ASCII.

If you look at the source of the message, you should see a header line 
like “Content-Transfer-Encoding: quoted-printable”. This is how your email 
client knows how to display the text properly.

[toc] | [prev] | [next] | [standalone]

#196925

From	Cameron Simpson <cs@cskk.id.au>
Date	2024-11-01 07:50 +1100
Message-ID	<mailman.63.1730408232.4695.python-list@python.org>
In reply to	#196919

On 31Oct2024 16:33, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>I have a command-line program which creates an email containing German
>umlauts.  On receiving the mail, my mail client displays the subject and
>body correctly:
[...]
>So far, so good.  However, when I use the --verbose option to print
>the mail to the terminal via
>
>  if args.verbose:
>      print(mail)
>
>I get:
>
>  Subject: Übungsbetreff
>
>  Sehr geehrter Herr Dr. Bennett,
>
>  Dies ist eine =C3=9Cbung.
>
>What do I need to do to prevent the body from getting mangled?

That looks to me like quoted-printable. This is an encoding for binary 
transport of text to make it robust against not 8-buit clean transports.  
So your Unicode text is encodings as UTF-8, and then that is encoded in 
quoted-printable for transport through the email system.

Your terminal probably accepts UTF-8 - I imagine other German text 
renders corectly?

You need to get the text and undo the quoted-printable encoding.

If you're using the Python email module to parse (or construct) the 
message as a `Message` object I'd expect that to happen automatically.

If you're just dealing with this directly, use the `quopri` stdlib 
module: https://docs.python.org/3/library/quopri.html

Cheers,
Cameron Simpson <cs@cskk.id.au>

[toc] | [prev] | [next] | [standalone]

#196929

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-01 08:11 +0100
Message-ID	<87msijo2cd.fsf@zedat.fu-berlin.de>
In reply to	#196925

Cameron Simpson <cs@cskk.id.au> writes:

> On 31Oct2024 16:33, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>>I have a command-line program which creates an email containing German
>>umlauts.  On receiving the mail, my mail client displays the subject and
>>body correctly:
> [...]
>>So far, so good.  However, when I use the --verbose option to print
>>the mail to the terminal via
>>
>>  if args.verbose:
>>      print(mail)
>>
>>I get:
>>
>>  Subject: Übungsbetreff
>>
>>  Sehr geehrter Herr Dr. Bennett,
>>
>>  Dies ist eine =C3=9Cbung.
>>
>>What do I need to do to prevent the body from getting mangled?
>
> That looks to me like quoted-printable. This is an encoding for binary
> transport of text to make it robust against not 8-buit clean
> transports.  So your Unicode text is encodings as UTF-8, and then that
> is encoded in quoted-printable for transport through the email system.

As I mentioned, I think the problem is to do with the way the salutation
text provided by the "salutation server" and the mail body from a file
are encoded.  This seems to be different.  

> Your terminal probably accepts UTF-8 - I imagine other German text
> renders corectly?

Yes, it does.

> You need to get the text and undo the quoted-printable encoding.
>
> If you're using the Python email module to parse (or construct) the
> message as a `Message` object I'd expect that to happen automatically.

I am using

  email.message.EmailMessage

as, from the Python documentation

  https://docs.python.org/3/library/email.examples.html

I gathered that that is the standard approach.

And you are right that encoding for the actual mail which is received is
automatically sorted out.  If I display the raw email in my client I get
the following:

  Content-Type: text/plain; charset="utf-8"
  Content-Transfer-Encoding: quoted-printable
  ...
  Subject: =?utf-8?q?=C3=9Cbungsbetreff?=
  ...
  Dies ist eine =C3=9Cbung.

I would interpret that as meaning that the subject and body are encoded
in the same way.

The problem just occurs with the unsent string representation printed to
the terminal.

Cheers,

Loris

-- 
This signature is currently under constuction.

[toc] | [prev] | [next] | [standalone]

#196930

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-01 10:10 +0100
Message-ID	<875xp7nwus.fsf@zedat.fu-berlin.de>
In reply to	#196929

"Loris Bennett" <loris.bennett@fu-berlin.de> writes:

> Cameron Simpson <cs@cskk.id.au> writes:
>
>> On 31Oct2024 16:33, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>>>I have a command-line program which creates an email containing German
>>>umlauts.  On receiving the mail, my mail client displays the subject and
>>>body correctly:
>> [...]
>>>So far, so good.  However, when I use the --verbose option to print
>>>the mail to the terminal via
>>>
>>>  if args.verbose:
>>>      print(mail)
>>>
>>>I get:
>>>
>>>  Subject: Übungsbetreff
>>>
>>>  Sehr geehrter Herr Dr. Bennett,
>>>
>>>  Dies ist eine =C3=9Cbung.
>>>
>>>What do I need to do to prevent the body from getting mangled?
>>
>> That looks to me like quoted-printable. This is an encoding for binary
>> transport of text to make it robust against not 8-buit clean
>> transports.  So your Unicode text is encodings as UTF-8, and then that
>> is encoded in quoted-printable for transport through the email system.
>
> As I mentioned, I think the problem is to do with the way the salutation
> text provided by the "salutation server" and the mail body from a file
> are encoded.  This seems to be different.  
>
>> Your terminal probably accepts UTF-8 - I imagine other German text
>> renders corectly?
>
> Yes, it does.
>
>> You need to get the text and undo the quoted-printable encoding.
>>
>> If you're using the Python email module to parse (or construct) the
>> message as a `Message` object I'd expect that to happen automatically.
>
> I am using
>
>   email.message.EmailMessage
>
> as, from the Python documentation
>
>   https://docs.python.org/3/library/email.examples.html
>
> I gathered that that is the standard approach.
>
> And you are right that encoding for the actual mail which is received is
> automatically sorted out.  If I display the raw email in my client I get
> the following:
>
>   Content-Type: text/plain; charset="utf-8"
>   Content-Transfer-Encoding: quoted-printable
>   ...
>   Subject: =?utf-8?q?=C3=9Cbungsbetreff?=
>   ...
>   Dies ist eine =C3=9Cbung.
>
> I would interpret that as meaning that the subject and body are encoded
> in the same way.
>
> The problem just occurs with the unsent string representation printed to
> the terminal.

If I log the body like this

  body = f"{salutation},\n\n{text}\n{signature}"
  logger.debug("body: " + body)
 
and look at the log file in my terminal I see 

  2024-11-01 09:59:12,318 - DEBUG - mailer:create_body - body: Sehr geehrter Herr Dr. Bennett,

  Dies ist eine Übung.
 
  ...

as expected.  The non-UTF-8 text occurs when I do

  mail = EmailMessage()
  mail.set_content(body, cte="quoted-printable")
  ...

  if args.verbose:   
      print(mail)

which is presumably also correct.

The question is: What conversion is necessary in order to print the
EmailMessage object to the terminal, such that the quoted-printable
parts are turned (back) into UTF-8?

Cheers,

Loris

-- 
This signature is currently under constuction.

[toc] | [prev] | [next] | [standalone]

#196933

From	dieter.maurer@online.de
Date	2024-11-01 17:38 +0100
Message-ID	<mailman.67.1730480556.4695.python-list@python.org>
In reply to	#196930

Loris Bennett wrote at 2024-11-1 10:10 +0100:
> ...
>  mail.set_content(body, cte="quoted-printable")

In the line above, you request the content to use
the "cte" (= "Content-Transfer-Encoding") "quoted-printable"
and consequently, the content is encoded with `quoted-printable`.
Maybe, you do not need to pass `cte`?

[toc] | [prev] | [next] | [standalone]

#196939

From	Cameron Simpson <cs@cskk.id.au>
Date	2024-11-02 08:47 +1100
Message-ID	<mailman.69.1730497664.4695.python-list@python.org>
In reply to	#196930

On 01Nov2024 10:10, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>as expected.  The non-UTF-8 text occurs when I do
>
>  mail = EmailMessage()
>  mail.set_content(body, cte="quoted-printable")
>  ...
>
>  if args.verbose:
>      print(mail)
>
>which is presumably also correct.
>
>The question is: What conversion is necessary in order to print the
>EmailMessage object to the terminal, such that the quoted-printable
>parts are turned (back) into UTF-8?

Do you still have access to `body` ? That would be the original message 
text? Otherwise maybe:

     print(mail.get_content())

The objective is to obtain the message body Unicode text (i.e. a regular 
Python string with the original text, unencoded). And to print that.

[toc] | [prev] | [next] | [standalone]

#196950

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-04 11:44 +0100
Message-ID	<87ed3rmg7g.fsf@zedat.fu-berlin.de>
In reply to	#196939

Cameron Simpson <cs@cskk.id.au> writes:

> On 01Nov2024 10:10, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>>as expected.  The non-UTF-8 text occurs when I do
>>
>>  mail = EmailMessage()
>>  mail.set_content(body, cte="quoted-printable")
>>  ...
>>
>>  if args.verbose:
>>      print(mail)
>>
>>which is presumably also correct.
>>
>>The question is: What conversion is necessary in order to print the
>>EmailMessage object to the terminal, such that the quoted-printable
>>parts are turned (back) into UTF-8?
>
> Do you still have access to `body` ? That would be the original
> message text? Otherwise maybe:
>
>     print(mail.get_content())
>
> The objective is to obtain the message body Unicode text (i.e. a
> regular Python string with the original text, unencoded). And to print
> that.

With the following:

######################################################################

import email.message

m = email.message.EmailMessage()

m['Subject'] = 'Übung'

m.set_content('Dies ist eine Übung')
print('== cte: default == \n')
print(m)

print('-- full mail ---')
print(m)
print('-- just content--')
print(m.get_content())

m.set_content('Dies ist eine Übung', cte='quoted-printable')
print('== cte: quoted-printable ==\n')
print('-- full mail --')
print(m)
print('-- just content --')
print(m.get_content())

######################################################################

I get the following output:

######################################################################

== cte: default == 

Subject: Übung
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0

RGllcyBpc3QgZWluZSDDnGJ1bmcK

-- full mail ---
Subject: Übung
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0

RGllcyBpc3QgZWluZSDDnGJ1bmcK

-- just content--
Dies ist eine Übung

== cte: quoted-printable ==

-- full mail --
Subject: Übung
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Dies ist eine =C3=9Cbung

-- just content --
Dies ist eine Übung

######################################################################

So in both cases the subject is fine, but it is unclear to me how to
print the body.  Or rather, I know how to print the body OK, but I don't
know how to print the headers separately - there seems to be nothing
like 'get_headers()'.  I can use 'get('Subject) etc. and reconstruct the
headers, but that seems a little clunky.  

Cheers,

Loris

-- 
This signature is currently under constuction.

[toc] | [prev] | [next] | [standalone]

#196952

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-04 11:57 +0100
Message-ID	<875xp3mfku.fsf@zedat.fu-berlin.de>
In reply to	#196950

"Loris Bennett" <loris.bennett@fu-berlin.de> writes:

> Cameron Simpson <cs@cskk.id.au> writes:
>
>> On 01Nov2024 10:10, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>>>as expected.  The non-UTF-8 text occurs when I do
>>>
>>>  mail = EmailMessage()
>>>  mail.set_content(body, cte="quoted-printable")
>>>  ...
>>>
>>>  if args.verbose:
>>>      print(mail)
>>>
>>>which is presumably also correct.
>>>
>>>The question is: What conversion is necessary in order to print the
>>>EmailMessage object to the terminal, such that the quoted-printable
>>>parts are turned (back) into UTF-8?
>>
>> Do you still have access to `body` ? That would be the original
>> message text? Otherwise maybe:
>>
>>     print(mail.get_content())
>>
>> The objective is to obtain the message body Unicode text (i.e. a
>> regular Python string with the original text, unencoded). And to print
>> that.
>
> With the following:
>
> ######################################################################
>
> import email.message
>
> m = email.message.EmailMessage()
>
> m['Subject'] = 'Übung'
>
> m.set_content('Dies ist eine Übung')
> print('== cte: default == \n')
> print(m)
>
> print('-- full mail ---')
> print(m)
> print('-- just content--')
> print(m.get_content())
>
> m.set_content('Dies ist eine Übung', cte='quoted-printable')
> print('== cte: quoted-printable ==\n')
> print('-- full mail --')
> print(m)
> print('-- just content --')
> print(m.get_content())
>
> ######################################################################
>
> I get the following output:
>
> ######################################################################
>
> == cte: default == 
>
> Subject: Übung
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: base64
> MIME-Version: 1.0
>
> RGllcyBpc3QgZWluZSDDnGJ1bmcK
>
> -- full mail ---
> Subject: Übung
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: base64
> MIME-Version: 1.0
>
> RGllcyBpc3QgZWluZSDDnGJ1bmcK
>
> -- just content--
> Dies ist eine Übung
>
> == cte: quoted-printable ==
>
> -- full mail --
> Subject: Übung
> MIME-Version: 1.0
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: quoted-printable
>
> Dies ist eine =C3=9Cbung
>
> -- just content --
> Dies ist eine Übung
>
> ######################################################################
>
> So in both cases the subject is fine, but it is unclear to me how to
> print the body.  Or rather, I know how to print the body OK, but I don't
> know how to print the headers separately - there seems to be nothing
> like 'get_headers()'.  I can use 'get('Subject) etc. and reconstruct the
> headers, but that seems a little clunky.  

Sorry, I am confusing the terminology here.  The 'body' seems to be the
headers plus the 'content'.  So I can print the *content* without the
headers OK, but I can't easily print all the headers separately.  If
just print the body, i.e. headers plus content, the umlauts in the
content are not resolved.

-- 
This signature is currently under constuction.

[toc] | [prev] | [next] | [standalone]

#196953

From	"Loris Bennett" <loris.bennett@fu-berlin.de>
Date	2024-11-04 13:02 +0100
Message-ID	<871pzrmcky.fsf@zedat.fu-berlin.de>
In reply to	#196952

"Loris Bennett" <loris.bennett@fu-berlin.de> writes:

> "Loris Bennett" <loris.bennett@fu-berlin.de> writes:
>
>> Cameron Simpson <cs@cskk.id.au> writes:
>>
>>> On 01Nov2024 10:10, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>>>>as expected.  The non-UTF-8 text occurs when I do
>>>>
>>>>  mail = EmailMessage()
>>>>  mail.set_content(body, cte="quoted-printable")
>>>>  ...
>>>>
>>>>  if args.verbose:
>>>>      print(mail)
>>>>
>>>>which is presumably also correct.
>>>>
>>>>The question is: What conversion is necessary in order to print the
>>>>EmailMessage object to the terminal, such that the quoted-printable
>>>>parts are turned (back) into UTF-8?
>>>
>>> Do you still have access to `body` ? That would be the original
>>> message text? Otherwise maybe:
>>>
>>>     print(mail.get_content())
>>>
>>> The objective is to obtain the message body Unicode text (i.e. a
>>> regular Python string with the original text, unencoded). And to print
>>> that.
>>
>> With the following:
>>
>> ######################################################################
>>
>> import email.message
>>
>> m = email.message.EmailMessage()
>>
>> m['Subject'] = 'Übung'
>>
>> m.set_content('Dies ist eine Übung')
>> print('== cte: default == \n')
>> print(m)
>>
>> print('-- full mail ---')
>> print(m)
>> print('-- just content--')
>> print(m.get_content())
>>
>> m.set_content('Dies ist eine Übung', cte='quoted-printable')
>> print('== cte: quoted-printable ==\n')
>> print('-- full mail --')
>> print(m)
>> print('-- just content --')
>> print(m.get_content())
>>
>> ######################################################################
>>
>> I get the following output:
>>
>> ######################################################################
>>
>> == cte: default == 
>>
>> Subject: Übung
>> Content-Type: text/plain; charset="utf-8"
>> Content-Transfer-Encoding: base64
>> MIME-Version: 1.0
>>
>> RGllcyBpc3QgZWluZSDDnGJ1bmcK
>>
>> -- full mail ---
>> Subject: Übung
>> Content-Type: text/plain; charset="utf-8"
>> Content-Transfer-Encoding: base64
>> MIME-Version: 1.0
>>
>> RGllcyBpc3QgZWluZSDDnGJ1bmcK
>>
>> -- just content--
>> Dies ist eine Übung
>>
>> == cte: quoted-printable ==
>>
>> -- full mail --
>> Subject: Übung
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset="utf-8"
>> Content-Transfer-Encoding: quoted-printable
>>
>> Dies ist eine =C3=9Cbung
>>
>> -- just content --
>> Dies ist eine Übung
>>
>> ######################################################################
>>
>> So in both cases the subject is fine, but it is unclear to me how to
>> print the body.  Or rather, I know how to print the body OK, but I don't
>> know how to print the headers separately - there seems to be nothing
>> like 'get_headers()'.  I can use 'get('Subject) etc. and reconstruct the
>> headers, but that seems a little clunky.  
>
> Sorry, I am confusing the terminology here.  The 'body' seems to be the
> headers plus the 'content'.  So I can print the *content* without the
> headers OK, but I can't easily print all the headers separately.  If
> just print the body, i.e. headers plus content, the umlauts in the
> content are not resolved.

OK, so I can do:

######################################################################
if args.verbose:
    for k in mail.keys():
        print(f"{k}: {mail.get(k)}")
    print('')
    print(mail.get_content())
######################################################################

prints what I want and is not wildly clunky, but I am a little surprised
that I can't get a string representation of the whole email in one go.

Cheers,

Loris


-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

[toc] | [prev] | [next] | [standalone]

#196960

From	"Peter J. Holzer" <hjp-python@hjp.at>
Date	2024-11-05 21:39 +0100
Message-ID	<mailman.81.1730839621.4695.python-list@python.org>
In reply to	#196953

[Multipart message — attachments visible in raw view] — view raw

On 2024-11-04 13:02:21 +0100, Loris Bennett via Python-list wrote:
> "Loris Bennett" <loris.bennett@fu-berlin.de> writes:
> > "Loris Bennett" <loris.bennett@fu-berlin.de> writes:
> >> Cameron Simpson <cs@cskk.id.au> writes:
> >>> On 01Nov2024 10:10, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
> >>>>as expected.  The non-UTF-8 text occurs when I do
> >>>>
> >>>>  mail = EmailMessage()
> >>>>  mail.set_content(body, cte="quoted-printable")
> >>>>  ...
> >>>>
> >>>>  if args.verbose:
> >>>>      print(mail)
> >>>>
> >>>>which is presumably also correct.
> >>>>
> >>>>The question is: What conversion is necessary in order to print the
> >>>>EmailMessage object to the terminal, such that the quoted-printable
> >>>>parts are turned (back) into UTF-8?
[...]
> OK, so I can do:
> 
> ######################################################################
> if args.verbose:
>     for k in mail.keys():
>         print(f"{k}: {mail.get(k)}")
>     print('')
>     print(mail.get_content())
> ######################################################################
> 
> prints what I want and is not wildly clunky, but I am a little surprised
> that I can't get a string representation of the whole email in one go.

Mails can contain lots of stuff, so there is in general no suitable
human readable string representation of a whole email. You have to go
through it part by part and decide what you want to do with each. For
example, if you have a multipart/alternative with a text/plain and a
text/html part what should the "string representation" be? For some uses
the text/plain part might be sufficient. For some you might want the
HTML part or some rendering of it. Or what would you do with an image?
Omit it completely? Just use the filename (if any)? Try to convert it to
ASCII-Art? Use an AI to describe it?

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

[toc] | [prev] | [next] | [standalone]

#196963

From	Cameron Simpson <cs@cskk.id.au>
Date	2024-11-06 08:20 +1100
Message-ID	<mailman.84.1730841650.4695.python-list@python.org>
In reply to	#196953

On 04Nov2024 13:02, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>OK, so I can do:
>
>######################################################################
>if args.verbose:
>    for k in mail.keys():
>        print(f"{k}: {mail.get(k)}")
>    print('')
>    print(mail.get_content())
>######################################################################
>
>prints what I want and is not wildly clunky, but I am a little surprised
>that I can't get a string representation of the whole email in one go.

A string representation of the whole message needs to be correctly 
encoded so that its components can be identified mechanically. So it 
needs to be a syntacticly valid RFC5322 message. Thus the encoding.

As an example (slightly contrived) of why this is important, multipart 
messages are delimited with distinct lines, and their content may not 
present such a line (even f it's in the "raw" original data).

So printing a whole message transcribes it in the encoded form so that 
it can be decoded mechanically. And conservativly, this is usually an 
ASCII compatibly encoding so that it can traverse various systems 
undamaged. This means the text requiring UTF8 encoding get further 
encoded as quoted printable to avoid ambiguity about the meaning of 
bytes/octets which have their high bit set.

BTW, doesn't this:

     for k in mail.keys():
         print(f"{k}: {mail.get(k)}")

print the quoted printable (i.e. not decoded) form of subject lines?

Cheers,
Cameron Simpson <cs@cskk.id.au>

[toc] | [prev] | [next] | [standalone]

#196938

From	Cameron Simpson <cs@cskk.id.au>
Date	2024-11-02 08:44 +1100
Message-ID	<mailman.68.1730497471.4695.python-list@python.org>
In reply to	#196929

On 01Nov2024 08:11, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
>Cameron Simpson <cs@cskk.id.au> writes:
>> If you're using the Python email module to parse (or construct) the
>> message as a `Message` object I'd expect that to happen automatically.
>
>I am using
>  email.message.EmailMessage

Noted. That seems like the correct approach to me.

>And you are right that encoding for the actual mail which is received 
>is
>automatically sorted out.  If I display the raw email in my client I get
>the following:
>
>  Content-Type: text/plain; charset="utf-8"
>  Content-Transfer-Encoding: quoted-printable
>  ...
>  Subject: =?utf-8?q?=C3=9Cbungsbetreff?=
>  ...
>  Dies ist eine =C3=9Cbung.

Right. Quoted-printable encoding for the transport.

>I would interpret that as meaning that the subject and body are encoded
>in the same way.

Yes.

>The problem just occurs with the unsent string representation printed to
>the terminal.

Yes, and I was thinking abut this yesterday. I suspect that 
`print(some_message_object)` is intended to transcribe it for transport.  
For example, one could write to an mbox file and just print() the 
message into it and get correct transport/storage formatting, which 
includes the qp encoding.

Can you should the code (or example code) which leads to the qp output?  
I suspect there's a straight forward way to get the decoded Unicode, but 
I'd need to see how what you've got was obtained.

[toc] | [prev] | [standalone]

csiph-web

Printing UTF-8 mail to terminal

Contents

#196919 — Printing UTF-8 mail to terminal

#196922

#196928

#196948

#196951

#196924 — Re: Printing UTF-8 mail to terminal (Posting On Python-List Prohibited)

#196925

#196929

#196930

#196933

#196939

#196950

#196952

#196953

#196960

#196963

#196938