Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #89986 > unrolled thread
| Started by | Paul Moore <p.f.moore@gmail.com> |
|---|---|
| First post | 2015-05-05 11:19 -0700 |
| Last post | 2015-05-08 15:28 +0300 |
| Articles | 7 — 6 participants |
Back to article view | Back to comp.lang.python
Stripping unencodable characters from a string Paul Moore <p.f.moore@gmail.com> - 2015-05-05 11:19 -0700
Re: Stripping unencodable characters from a string Dave Angel <davea@davea.name> - 2015-05-05 15:00 -0400
Re: Stripping unencodable characters from a string Paul Moore <p.f.moore@gmail.com> - 2015-05-05 12:24 -0700
Re: Stripping unencodable characters from a string Marko Rauhamaa <marko@pacujo.net> - 2015-05-05 22:55 +0300
Re: Stripping unencodable characters from a string Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-05-05 19:33 +0000
Re: Stripping unencodable characters from a string Chris Angelico <rosuav@gmail.com> - 2015-05-06 10:02 +1000
Re: Stripping unencodable characters from a string Serhiy Storchaka <storchaka@gmail.com> - 2015-05-08 15:28 +0300
| From | Paul Moore <p.f.moore@gmail.com> |
|---|---|
| Date | 2015-05-05 11:19 -0700 |
| Subject | Stripping unencodable characters from a string |
| Message-ID | <24ef6c6d-a47a-4d8c-8651-c581e25161cb@googlegroups.com> |
I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
data = data.encode(file.encoding, errors='replace').decode(file.encoding)
file.write(data)
(I'd probably use backslashreplace rather than replace, but that's a minor point).
Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.
Thanks,
Paul.
[toc] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-05-05 15:00 -0400 |
| Message-ID | <mailman.137.1430852451.12865.python-list@python.org> |
| In reply to | #89986 |
On 05/05/2015 02:19 PM, Paul Moore wrote:
You need to specify that you're using Python 3.4 (or whichever) when
starting a new thread.
> I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is
>
> data = data.encode(file.encoding, errors='replace').decode(file.encoding)
> file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of.
>
> Thanks,
> Paul.
>
If you're going to take charge of the encoding of the file, why not just
open the file in binary, and do it all with
file.write(data.encode( myencoding, errors='replace') )
i can't see the benefit of two encodes and a decode just to write a
string to the file.
Alternatively, there's probably a way to open the file using
codecs.open(), and reassign it to sys.stdout.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Paul Moore <p.f.moore@gmail.com> |
|---|---|
| Date | 2015-05-05 12:24 -0700 |
| Message-ID | <a75300bf-7a30-464a-84cb-eb3cde4ca40f@googlegroups.com> |
| In reply to | #89991 |
On Tuesday, 5 May 2015 20:01:04 UTC+1, Dave Angel wrote: > On 05/05/2015 02:19 PM, Paul Moore wrote: > > You need to specify that you're using Python 3.4 (or whichever) when > starting a new thread. Sorry. 2.6, 2.7, and 3.3+. It's for use in a cross-version library. > If you're going to take charge of the encoding of the file, why not just > open the file in binary, and do it all with > file.write(data.encode( myencoding, errors='replace') ) I don't have control of the encoding of the file. It's typically sys.stdout, which is already open. I can't replace sys.stdout (because the main program which calls my library code wouldn't like me messing with global state behind its back). And sys.stdout isn't open in binary mode. > i can't see the benefit of two encodes and a decode just to write a > string to the file. Nor can I - that's my point. But if all I have is an open text-mode file with the "strict" error mode, I have to incur one encode, and I have to make sure that no characters are passed to that encode which can't be encoded. If there was a codec method to identify un-encodable characters, that might be an alternative (although it's quite possible that the encode/decode dance would be faster anyway, as it's mostly in C - not that performance is key here). > Alternatively, there's probably a way to open the file using > codecs.open(), and reassign it to sys.stdout. As I said, I have to work with the file (sys.stdout or whatever) that I'm given. I can't reopen or replace it. Paul
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-05-05 22:55 +0300 |
| Message-ID | <87zj5i6bpc.fsf@elektro.pacujo.net> |
| In reply to | #89993 |
Paul Moore <p.f.moore@gmail.com>: > Nor can I - that's my point. But if all I have is an open text-mode > file with the "strict" error mode, I have to incur one encode, and I > have to make sure that no characters are passed to that encode which > can't be encoded. The file-like object you are given carries some baggage. IOW, it's not a "file" in the sense you are thinking about it. It's some object that accepts data with its write() method. Now, Python file-like objects ostensibly implement a common interface. However, as you are describing here, not all write() methods accept the same arguments. Text file objects expect str objects while binary file objects expect bytes objects. Maybe there are yet other file-like objects that expect some other types of object as their arguments. Bottom line: Python doesn't fulfill your expectation. Your library can't operate on generic file-like objects because Python3 doesn't have generic file-like objects. Your library must do something else. For example, you could require a binary file object. The caller must then possibly wrap their actual object inside a converter, which is relatively trivial in Python. Marko
[toc] | [prev] | [next] | [standalone]
| From | Jon Ribbens <jon+usenet@unequivocal.co.uk> |
|---|---|
| Date | 2015-05-05 19:33 +0000 |
| Message-ID | <slrnmki6qg.2fu.jon+usenet@frosty.unequivocal.co.uk> |
| In reply to | #89986 |
On 2015-05-05, Paul Moore <p.f.moore@gmail.com> wrote:
> I want to write a string to an already-open file (sys.stdout,
> typically). However, I *don't* want encoding errors, and the string
> could be arbitrary Unicode (in theory). The best way I've found is
>
> data = data.encode(file.encoding, errors='replace').decode(file.encoding)
> file.write(data)
>
> (I'd probably use backslashreplace rather than replace, but that's a
> minor point).
>
> Is that the best way? The multiple re-encoding dance seems a bit
> clumsy, but it was the best I could think of.
Perhaps something like one of:
file.buffer.write(data.encode(file.encoding, errors="replace"))
or:
sys.stdout = io.TextIOWrapper(sys.stdout.detach(),
encoding=sys.stdout.encoding, errors="replace")
(both of which could go wrong in various ways depending on your
circumstances).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-06 10:02 +1000 |
| Message-ID | <mailman.152.1430870587.12865.python-list@python.org> |
| In reply to | #89986 |
On Wed, May 6, 2015 at 4:19 AM, Paul Moore <p.f.moore@gmail.com> wrote: > I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is > > data = data.encode(file.encoding, errors='replace').decode(file.encoding) > file.write(data) > > (I'd probably use backslashreplace rather than replace, but that's a minor point). > > Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of. The simplest solution would be to call ascii() on the string, which will give you an ASCII-only representation (using backslash escapes). If your goal is to write Unicode text to a log file in some safe way, this is what I would be doing. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2015-05-08 15:28 +0300 |
| Message-ID | <mailman.243.1431088153.12865.python-list@python.org> |
| In reply to | #89986 |
On 05.05.15 21:19, Paul Moore wrote: > I want to write a string to an already-open file (sys.stdout, typically). However, I *don't* want encoding errors, and the string could be arbitrary Unicode (in theory). The best way I've found is > > data = data.encode(file.encoding, errors='replace').decode(file.encoding) > file.write(data) > > (I'd probably use backslashreplace rather than replace, but that's a minor point). > > Is that the best way? The multiple re-encoding dance seems a bit clumsy, but it was the best I could think of. There are flaws in this approach. 1) file.encoding can be None (StringIO) or absent (general file-like object, that implements only write()). 2) When the encoding is UTF-16, UTF-32, UTF-8-SIG, the output will contain superfluous byte order marks. This is not easy problem and there is no simple solution. In particular cases you can create TextIOWrapper(file.buffer, 'w', encoding=file.encoding, errors='replace', newline=file.newlines, write_through=True) and write to it, but be aware of limitations.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web