Groups > comp.lang.python > #89986 > unrolled thread

Stripping unencodable characters from a string

Started by	Paul Moore <p.f.moore@gmail.com>
First post	2015-05-05 11:19 -0700
Last post	2015-05-08 15:28 +0300
Articles	7 — 6 participants

Back to article view | Back to comp.lang.python

  Stripping unencodable characters from a string Paul  Moore <p.f.moore@gmail.com> - 2015-05-05 11:19 -0700
    Re: Stripping unencodable characters from a string Dave Angel <davea@davea.name> - 2015-05-05 15:00 -0400
      Re: Stripping unencodable characters from a string Paul  Moore <p.f.moore@gmail.com> - 2015-05-05 12:24 -0700
        Re: Stripping unencodable characters from a string Marko Rauhamaa <marko@pacujo.net> - 2015-05-05 22:55 +0300
    Re: Stripping unencodable characters from a string Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-05 19:33 +0000
    Re: Stripping unencodable characters from a string Chris Angelico <rosuav@gmail.com> - 2015-05-06 10:02 +1000
    Re: Stripping unencodable characters from a string Serhiy Storchaka <storchaka@gmail.com> - 2015-05-08 15:28 +0300

#89986 — Stripping unencodable characters from a string

From	Paul Moore <p.f.moore@gmail.com>
Date	2015-05-05 11:19 -0700
Subject	Stripping unencodable characters from a string
Message-ID	<24ef6c6d-a47a-4d8c-8651-c581e25161cb@googlegroups.com>

I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is

    data = data.encode(file.encoding, errors='replace').decode(file.encoding)
    file.write(data)

(I'd probably use backslashreplace rather than replace, but that's a minor point).

Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

Thanks,
Paul.

[toc] | [next] | [standalone]

#89991

From	Dave Angel <davea@davea.name>
Date	2015-05-05 15:00 -0400
Message-ID	<mailman.137.1430852451.12865.python-list@python.org>
In reply to	#89986

On 05/05/2015 02:19 PM, Paul Moore wrote:

You need to specify that you're using Python 3.4 (or whichever) when 
starting a new thread.

> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>      data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>      file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.
>
> Thanks,
> Paul.
>

If you're going to take charge of the encoding of the file, why not just 
open the file in binary, and do it all with
     file.write(data.encode( myencoding, errors='replace') )

i can't see the benefit of two encodes and a decode just to write a 
string to the file.

Alternatively, there's probably a way to open the file using 
codecs.open(), and reassign it to sys.stdout.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#89993

From	Paul Moore <p.f.moore@gmail.com>
Date	2015-05-05 12:24 -0700
Message-ID	<a75300bf-7a30-464a-84cb-eb3cde4ca40f@googlegroups.com>
In reply to	#89991

On Tuesday, 5 May 2015 20:01:04 UTC+1, Dave Angel  wrote:
> On 05/05/2015 02:19 PM, Paul Moore wrote:
> 
> You need to specify that you're using Python 3.4 (or whichever) when 
> starting a new thread.

Sorry. 2.6, 2.7, and 3.3+. It's for use in a cross-version library.

> If you're going to take charge of the encoding of the file, why not just 
> open the file in binary, and do it all with
>      file.write(data.encode( myencoding, errors='replace') )

I don't have control of the encoding of the file. It's typically sys.stdout, which is already open. I can't replace sys.stdout (because the main program which calls my library code wouldn't like me messing with global state behind its back). And sys.stdout isn't open in binary mode.

> i can't see the benefit of two encodes and a decode just to write a 
> string to the file.

Nor can I - that's my point. But if all I have is an open text-mode file with the "strict" error mode, I have to incur one encode, and I have to make sure that no characters are passed to that encode which can't be encoded.

If there was a codec method to identify un-encodable characters, that might be an alternative (although it's quite possible that the encode/decode dance would be faster anyway, as it's mostly in C - not that performance is key here).

> Alternatively, there's probably a way to open the file using 
> codecs.open(), and reassign it to sys.stdout.

As I said, I have to work with the file (sys.stdout or whatever) that I'm given. I can't reopen or replace it.

Paul

[toc] | [prev] | [next] | [standalone]

#89996

From	Marko Rauhamaa <marko@pacujo.net>
Date	2015-05-05 22:55 +0300
Message-ID	<87zj5i6bpc.fsf@elektro.pacujo.net>
In reply to	#89993

Paul  Moore <p.f.moore@gmail.com>:

> Nor can I - that's my point. But if all I have is an open text-mode
> file with the "strict" error mode, I have to incur one encode, and I
> have to make sure that no characters are passed to that encode which
> can't be encoded.

The file-like object you are given carries some baggage. IOW, it's not a
"file" in the sense you are thinking about it. It's some object that
accepts data with its write() method.

Now, Python file-like objects ostensibly implement a common interface.
However, as you are describing here, not all write() methods accept the
same arguments. Text file objects expect str objects while binary file
objects expect bytes objects. Maybe there are yet other file-like
objects that expect some other types of object as their arguments.

Bottom line: Python doesn't fulfill your expectation. Your library can't
operate on generic file-like objects because Python3 doesn't have
generic file-like objects. Your library must do something else. For
example, you could require a binary file object. The caller must then
possibly wrap their actual object inside a converter, which is
relatively trivial in Python.


Marko

[toc] | [prev] | [next] | [standalone]

#89994

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-05-05 19:33 +0000
Message-ID	<slrnmki6qg.2fu.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#89986

On 2015-05-05, Paul Moore <p.f.moore@gmail.com> wrote:
> I want to write a string to an already-open file (sys.stdout,
> typically). However, I *don't* want encoding errors, and the string
> could be arbitrary Unicode (in theory). The best way I've found is
>
>     data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>     file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a
> minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit
> clumsy, but it was the best I could think of.

Perhaps something like one of:

  file.buffer.write(data.encode(file.encoding, errors="replace"))

or:

  sys.stdout = io.TextIOWrapper(sys.stdout.detach(),
      encoding=sys.stdout.encoding, errors="replace")

(both of which could go wrong in various ways depending on your
circumstances).

[toc] | [prev] | [next] | [standalone]

#90014

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-06 10:02 +1000
Message-ID	<mailman.152.1430870587.12865.python-list@python.org>
In reply to	#89986

On Wed, May 6, 2015 at 4:19 AM, Paul  Moore <p.f.moore@gmail.com> wrote:
> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>     data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>     file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

The simplest solution would be to call ascii() on the string, which
will give you an ASCII-only representation (using backslash escapes).
If your goal is to write Unicode text to a log file in some safe way,
this is what I would be doing.

ChrisA

[toc] | [prev] | [next] | [standalone]

#90158

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2015-05-08 15:28 +0300
Message-ID	<mailman.243.1431088153.12865.python-list@python.org>
In reply to	#89986

On 05.05.15 21:19, Paul Moore wrote:
> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
>      data = data.encode(file.encoding, errors='replace').decode(file.encoding)
>      file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.

There are flaws in this approach.

1) file.encoding can be None (StringIO) or absent (general file-like 
object, that implements only write()).

2) When the encoding is UTF-16, UTF-32, UTF-8-SIG, the output will 
contain superfluous byte order marks.

This is not easy problem and there is no simple solution. In particular 
cases you can create TextIOWrapper(file.buffer, 'w', 
encoding=file.encoding, errors='replace', newline=file.newlines, 
write_through=True) and write to it, but be aware of limitations.

[toc] | [prev] | [standalone]

csiph-web

Stripping unencodable characters from a string

Contents

#89986 — Stripping unencodable characters from a string

#89991

#89993

#89996

#89994

#90014

#90158