Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99961 > unrolled thread
| Started by | Adam Funk <a24061@ducksburg.com> |
|---|---|
| First post | 2015-12-03 15:12 +0000 |
| Last post | 2015-12-03 21:40 +0100 |
| Articles | 14 — 7 participants |
Back to article view | Back to comp.lang.python
getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 15:12 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 15:18 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Peter Otten <__peter__@web.de> - 2015-12-03 17:11 +0100
Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:17 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Terry Reedy <tjreedy@udel.edu> - 2015-12-03 11:48 -0500
Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:21 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2015-12-03 22:26 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Serhiy Storchaka <storchaka@gmail.com> - 2015-12-04 10:34 +0200
Re: getting fileinput to do errors='ignore' or 'replace'? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2015-12-04 09:00 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-07 14:46 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? MRAB <python@mrabarnett.plus.com> - 2015-12-03 16:12 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Laura Creighton <lac@openend.se> - 2015-12-03 17:46 +0100
Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:17 +0000
Re: getting fileinput to do errors='ignore' or 'replace'? Laura Creighton <lac@openend.se> - 2015-12-03 21:40 +0100
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-03 15:12 +0000 |
| Subject | getting fileinput to do errors='ignore' or 'replace'? |
| Message-ID | <fn26jcxltl.ln2@news.ducksburg.com> |
I'm having trouble with some input files that are almost all proper
UTF-8 but with a couple of troublesome characters mixed in, which I'd
like to ignore instead of throwing ValueError. I've found the
openhook for the encoding
for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
do_stuff(line)
which the documentation describes as "a hook which opens each file
with codecs.open(), using the given encoding to read the file", but
I'd like codecs.open() to also have the errors='ignore' or
errors='replace' effect. Is it possible to do this?
Thanks.
--
Why is it drug addicts and computer afficionados are both
called users? --- Clifford Stoll
[toc] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-03 15:18 +0000 |
| Message-ID | <8336jcxi2m.ln2@news.ducksburg.com> |
| In reply to | #99961 |
On 2015-12-03, Adam Funk wrote:
> I'm having trouble with some input files that are almost all proper
> UTF-8 but with a couple of troublesome characters mixed in, which I'd
> like to ignore instead of throwing ValueError. I've found the
> openhook for the encoding
>
> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
> do_stuff(line)
>
> which the documentation describes as "a hook which opens each file
> with codecs.open(), using the given encoding to read the file", but
> I'd like codecs.open() to also have the errors='ignore' or
> errors='replace' effect. Is it possible to do this?
I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
machines).
--
...the reason why so many professional artists drink a lot is not
necessarily very much to do with the artistic temperament, etc. It is
simply that they can afford to, because they can normally take a large
part of a day off to deal with the ravages. --- Amis _On Drink_
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-12-03 17:11 +0100 |
| Message-ID | <mailman.174.1449159118.14615.python-list@python.org> |
| In reply to | #99962 |
Adam Funk wrote:
> On 2015-12-03, Adam Funk wrote:
>
>> I'm having trouble with some input files that are almost all proper
>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>> like to ignore instead of throwing ValueError. I've found the
>> openhook for the encoding
>>
>> for line in fileinput.input(options.files,
>> openhook=fileinput.hook_encoded("utf-8")):
>> do_stuff(line)
>>
>> which the documentation describes as "a hook which opens each file
>> with codecs.open(), using the given encoding to read the file", but
>> I'd like codecs.open() to also have the errors='ignore' or
>> errors='replace' effect. Is it possible to do this?
>
> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
> machines).
Have a look at the source of fileinput.hook_encoded:
def hook_encoded(encoding):
import io
def openhook(filename, mode):
mode = mode.replace('U', '').replace('b', '') or 'r'
return io.open(filename, mode, encoding=encoding, newline='')
return openhook
You can use it as a template to write your own factory function:
def my_hook_encoded(encoding, errors=None):
import io
def openhook(filename, mode):
mode = mode.replace('U', '').replace('b', '') or 'r'
return io.open(
filename, mode,
encoding=encoding, newline='',
errors=errors)
return openhook
for line in fileinput.input(
options.files,
openhook=my_hook_encoded("utf-8", errors="ignore")):
do_stuff(line)
Another option is to create the function on the fly:
for line in fileinput.input(
options.files,
openhook=functools.partial(
io.open, encoding="utf-8", errors="replace")):
do_stuff(line)
(codecs.open() instead of io.open() should also work)
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-03 19:17 +0000 |
| Message-ID | <c3h6jcxtas.ln2@news.ducksburg.com> |
| In reply to | #99966 |
On 2015-12-03, Peter Otten wrote:
> def my_hook_encoded(encoding, errors=None):
> import io
> def openhook(filename, mode):
> mode = mode.replace('U', '').replace('b', '') or 'r'
> return io.open(
> filename, mode,
> encoding=encoding, newline='',
> errors=errors)
> return openhook
>
> for line in fileinput.input(
> options.files,
> openhook=my_hook_encoded("utf-8", errors="ignore")):
> do_stuff(line)
Perfect, thanks!
> (codecs.open() instead of io.open() should also work)
OK.
--
The internet is quite simply a glorious place. Where else can you find
bootlegged music and films, questionable women, deep seated xenophobia
and amusing cats all together in the same place? --- Tom Belshaw
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2015-12-03 11:48 -0500 |
| Message-ID | <mailman.180.1449161351.14615.python-list@python.org> |
| In reply to | #99962 |
On 12/3/2015 10:18 AM, Adam Funk wrote:
> On 2015-12-03, Adam Funk wrote:
>
>> I'm having trouble with some input files that are almost all proper
>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>> like to ignore instead of throwing ValueError. I've found the
>> openhook for the encoding
>>
>> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>> do_stuff(line)
>>
>> which the documentation describes as "a hook which opens each file
>> with codecs.open(), using the given encoding to read the file", but
>> I'd like codecs.open() to also have the errors='ignore' or
>> errors='replace' effect. Is it possible to do this?
>
> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
> machines).
fileinput is an ancient module that predates iterators (and generators)
and context managers. Since by 2.7 open files are both context managers
and line iterators, you can easily write your own multi-file line
iteration that does exactly what you want. At minimum:
for file in files:
with codecs.open(file, errors='ignore') as f
# did not look up signature,
for line in f:
do_stuff(line)
To make this reusable, wrap in 'def filelines(files):' and replace
'do_stuff(line)' with 'yield line'.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-03 19:21 +0000 |
| Message-ID | <5bh6jcxtas.ln2@news.ducksburg.com> |
| In reply to | #99973 |
On 2015-12-03, Terry Reedy wrote: > fileinput is an ancient module that predates iterators (and generators) > and context managers. Since by 2.7 open files are both context managers > and line iterators, you can easily write your own multi-file line > iteration that does exactly what you want. At minimum: > > for file in files: > with codecs.open(file, errors='ignore') as f > # did not look up signature, > for line in f: > do_stuff(line) > > To make this reusable, wrap in 'def filelines(files):' and replace > 'do_stuff(line)' with 'yield line'. I like fileinput because if the file list is empty, it reads from stdin instead (so I can pipe something else's output into it). Unfortunately, the fix I got elsewhere in this thread doesn't seem to work for that! -- Science is what we understand well enough to explain to a computer. Art is everything else we do. --- Donald Knuth
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2015-12-03 22:26 +0000 |
| Message-ID | <mailman.187.1449181591.14615.python-list@python.org> |
| In reply to | #99962 |
On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote:
>
> On 12/3/2015 10:18 AM, Adam Funk wrote:
>>
>> On 2015-12-03, Adam Funk wrote:
>>
>>> I'm having trouble with some input files that are almost all proper
>>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>> like to ignore instead of throwing ValueError. I've found the
>>> openhook for the encoding
>>>
>>> for line in fileinput.input(options.files,
openhook=fileinput.hook_encoded("utf-8")):
>>> do_stuff(line)
>>>
>>> which the documentation describes as "a hook which opens each file
>>> with codecs.open(), using the given encoding to read the file", but
>>> I'd like codecs.open() to also have the errors='ignore' or
>>> errors='replace' effect. Is it possible to do this?
>>
>>
>> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
>> machines).
>
>
> fileinput is an ancient module that predates iterators (and generators)
and context managers. Since by 2.7 open files are both context managers and
line iterators, you can easily write your own multi-file line iteration
that does exactly what you want. At minimum:
>
> for file in files:
> with codecs.open(file, errors='ignore') as f
> # did not look up signature,
> for line in f:
> do_stuff(line)
The above is fine but...
> To make this reusable, wrap in 'def filelines(files):' and replace
'do_stuff(line)' with 'yield line'.
That doesn't work entirely correctly as you end up yielding from inside a
with statement. If the user of your generator function doesn't fully
consume the generator then whichever file is currently open is not
guaranteed to be closed.
--
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2015-12-04 10:34 +0200 |
| Message-ID | <mailman.191.1449218096.14615.python-list@python.org> |
| In reply to | #99962 |
On 04.12.15 00:26, Oscar Benjamin wrote:
> On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote:
>> fileinput is an ancient module that predates iterators (and generators)
> and context managers. Since by 2.7 open files are both context managers and
> line iterators, you can easily write your own multi-file line iteration
> that does exactly what you want. At minimum:
>>
>> for file in files:
>> with codecs.open(file, errors='ignore') as f
>> # did not look up signature,
>> for line in f:
>> do_stuff(line)
>
> The above is fine but...
>
>> To make this reusable, wrap in 'def filelines(files):' and replace
> 'do_stuff(line)' with 'yield line'.
>
> That doesn't work entirely correctly as you end up yielding from inside a
> with statement. If the user of your generator function doesn't fully
> consume the generator then whichever file is currently open is not
> guaranteed to be closed.
You can convert the generator to context manager and use it in the with
statement to guarantee closing.
with contextlib.closing(filelines(files)) as f:
for line in f:
...
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2015-12-04 09:00 +0000 |
| Message-ID | <mailman.193.1449226525.14615.python-list@python.org> |
| In reply to | #99962 |
On 4 Dec 2015 08:36, "Serhiy Storchaka" <storchaka@gmail.com> wrote: > > On 04.12.15 00:26, Oscar Benjamin wrote: >> >> On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote: >>> >>> fileinput is an ancient module that predates iterators (and generators) >> >> and context managers. Since by 2.7 open files are both context managers and >> line iterators, you can easily write your own multi-file line iteration >> that does exactly what you want. At minimum: >>> >>> >>> for file in files: >>> with codecs.open(file, errors='ignore') as f >>> # did not look up signature, >>> for line in f: >>> do_stuff(line) >> >> >> The above is fine but... >> >>> To make this reusable, wrap in 'def filelines(files):' and replace >> >> 'do_stuff(line)' with 'yield line'. >> >> That doesn't work entirely correctly as you end up yielding from inside a >> with statement. If the user of your generator function doesn't fully >> consume the generator then whichever file is currently open is not >> guaranteed to be closed. > > > You can convert the generator to context manager and use it in the with statement to guarantee closing. > > with contextlib.closing(filelines(files)) as f: > for line in f: > ... Or you can use fileinput which is designed to be exactly this kind of context manager and to be used in this way. Although fileinput is slightly awkward in defaulting to reading stdin. -- Oscar
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-07 14:46 +0000 |
| Message-ID | <2oigjcx5nh.ln2@news.ducksburg.com> |
| In reply to | #99999 |
On 2015-12-04, Oscar Benjamin wrote:
> Or you can use fileinput which is designed to be exactly this kind of
> context manager and to be used in this way. Although fileinput is slightly
> awkward in defaulting to reading stdin.
That default is what I specifically like about fileinput --- it's a
normal way for command-line tools to work:
$ sort file0 file1 file2 >sorted.txt
$ generate_junk | sort >sorted_junk.txt
--
$2.95!
PLATE O' SHRIMP
Luncheon Special
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2015-12-03 16:12 +0000 |
| Message-ID | <mailman.175.1449159185.14615.python-list@python.org> |
| In reply to | #99961 |
On 2015-12-03 15:12, Adam Funk wrote:
> I'm having trouble with some input files that are almost all proper
> UTF-8 but with a couple of troublesome characters mixed in, which I'd
> like to ignore instead of throwing ValueError. I've found the
> openhook for the encoding
>
> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
> do_stuff(line)
>
> which the documentation describes as "a hook which opens each file
> with codecs.open(), using the given encoding to read the file", but
> I'd like codecs.open() to also have the errors='ignore' or
> errors='replace' effect. Is it possible to do this?
>
It looks like it's not possible with the standard "hook_encoded", but
you could write your own alternative:
import codecs
def my_hook_encoded(encoding, errors):
def opener(path, mode):
return codecs.open(path, mode, encoding=encoding, errors=errors)
return opener
for line in fileinput.input(options.files,
openhook=fileinput.my_hook_encoded("utf-8", "ignore")):
do_stuff(line)
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-12-03 17:46 +0100 |
| Message-ID | <mailman.179.1449161213.14615.python-list@python.org> |
| In reply to | #99961 |
In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>I'm having trouble with some input files that are almost all proper
>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>like to ignore instead of throwing ValueError. I've found the
>openhook for the encoding
>
>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
> do_stuff(line)
>
>which the documentation describes as "a hook which opens each file
>with codecs.open(), using the given encoding to read the file", but
>I'd like codecs.open() to also have the errors='ignore' or
>errors='replace' effect. Is it possible to do this?
>
>Thanks.
This should be both easy to add, and useful, and I happen to know that
fileinput is being hacked on by Serhiy Storchaka right now, who agrees
that this would be easy. So, with his approval, I stuck this into the
tracker. http://bugs.python.org/issue25788
Future Pythons may not have the problem.
Laura
[toc] | [prev] | [next] | [standalone]
| From | Adam Funk <a24061@ducksburg.com> |
|---|---|
| Date | 2015-12-03 19:17 +0000 |
| Message-ID | <v3h6jcxtas.ln2@news.ducksburg.com> |
| In reply to | #99972 |
On 2015-12-03, Laura Creighton wrote:
> In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>>I'm having trouble with some input files that are almost all proper
>>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>like to ignore instead of throwing ValueError. I've found the
>>openhook for the encoding
>>
>>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>> do_stuff(line)
>>
>>which the documentation describes as "a hook which opens each file
>>with codecs.open(), using the given encoding to read the file", but
>>I'd like codecs.open() to also have the errors='ignore' or
>>errors='replace' effect. Is it possible to do this?
>>
>>Thanks.
>
> This should be both easy to add, and useful, and I happen to know that
> fileinput is being hacked on by Serhiy Storchaka right now, who agrees
> that this would be easy. So, with his approval, I stuck this into the
> tracker. http://bugs.python.org/issue25788
>
> Future Pythons may not have the problem.
Good to know, thanks.
--
You cannot really appreciate Dilbert unless you've read it in the
original Klingon. --- Klingon Programmer's Guide
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-12-03 21:40 +0100 |
| Message-ID | <mailman.185.1449175233.14615.python-list@python.org> |
| In reply to | #99982 |
In a message of Thu, 03 Dec 2015 19:17:51 +0000, Adam Funk writes:
>On 2015-12-03, Laura Creighton wrote:
>
>> In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>>>I'm having trouble with some input files that are almost all proper
>>>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>>like to ignore instead of throwing ValueError. I've found the
>>>openhook for the encoding
>>>
>>>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>>> do_stuff(line)
>>>
>>>which the documentation describes as "a hook which opens each file
>>>with codecs.open(), using the given encoding to read the file", but
>>>I'd like codecs.open() to also have the errors='ignore' or
>>>errors='replace' effect. Is it possible to do this?
>>>
>>>Thanks.
>>
>> This should be both easy to add, and useful, and I happen to know that
>> fileinput is being hacked on by Serhiy Storchaka right now, who agrees
>> that this would be easy. So, with his approval, I stuck this into the
>> tracker. http://bugs.python.org/issue25788
>>
>> Future Pythons may not have the problem.
>
>Good to know, thanks.
Well, we have moved right along to 'You write the patch, Laura' so I
can pretty much guarantee that future Pythons won't have the problem. :)
Laura
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web