Groups > comp.lang.python > #99961 > unrolled thread

getting fileinput to do errors='ignore' or 'replace'?

Started by	Adam Funk <a24061@ducksburg.com>
First post	2015-12-03 15:12 +0000
Last post	2015-12-03 21:40 +0100
Articles	14 — 7 participants

Back to article view | Back to comp.lang.python

  getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 15:12 +0000
    Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 15:18 +0000
      Re: getting fileinput to do errors='ignore' or 'replace'? Peter Otten <__peter__@web.de> - 2015-12-03 17:11 +0100
        Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:17 +0000
      Re: getting fileinput to do errors='ignore' or 'replace'? Terry Reedy <tjreedy@udel.edu> - 2015-12-03 11:48 -0500
        Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:21 +0000
      Re: getting fileinput to do errors='ignore' or 'replace'? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2015-12-03 22:26 +0000
      Re: getting fileinput to do errors='ignore' or 'replace'? Serhiy Storchaka <storchaka@gmail.com> - 2015-12-04 10:34 +0200
      Re: getting fileinput to do errors='ignore' or 'replace'? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2015-12-04 09:00 +0000
        Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-07 14:46 +0000
    Re: getting fileinput to do errors='ignore' or 'replace'? MRAB <python@mrabarnett.plus.com> - 2015-12-03 16:12 +0000
    Re: getting fileinput to do errors='ignore' or 'replace'? Laura Creighton <lac@openend.se> - 2015-12-03 17:46 +0100
      Re: getting fileinput to do errors='ignore' or 'replace'? Adam Funk <a24061@ducksburg.com> - 2015-12-03 19:17 +0000
        Re: getting fileinput to do errors='ignore' or 'replace'? Laura Creighton <lac@openend.se> - 2015-12-03 21:40 +0100

#99961 — getting fileinput to do errors='ignore' or 'replace'?

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-03 15:12 +0000
Subject	getting fileinput to do errors='ignore' or 'replace'?
Message-ID	<fn26jcxltl.ln2@news.ducksburg.com>

I'm having trouble with some input files that are almost all proper
UTF-8 but with a couple of troublesome characters mixed in, which I'd
like to ignore instead of throwing ValueError.  I've found the
openhook for the encoding

for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
    do_stuff(line)

which the documentation describes as "a hook which opens each file
with codecs.open(), using the given encoding to read the file", but
I'd like codecs.open() to also have the errors='ignore' or
errors='replace' effect.  Is it possible to do this?

Thanks.


-- 
Why is it drug addicts and computer afficionados are both 
called users?                          --- Clifford Stoll

[toc] | [next] | [standalone]

#99962

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-03 15:18 +0000
Message-ID	<8336jcxi2m.ln2@news.ducksburg.com>
In reply to	#99961

On 2015-12-03, Adam Funk wrote:

> I'm having trouble with some input files that are almost all proper
> UTF-8 but with a couple of troublesome characters mixed in, which I'd
> like to ignore instead of throwing ValueError.  I've found the
> openhook for the encoding
>
> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>     do_stuff(line)
>
> which the documentation describes as "a hook which opens each file
> with codecs.open(), using the given encoding to read the file", but
> I'd like codecs.open() to also have the errors='ignore' or
> errors='replace' effect.  Is it possible to do this?

I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
machines).


-- 
...the reason why so many professional artists drink a lot is not
necessarily very much to do with the artistic temperament, etc.  It is
simply that they can afford to, because they can normally take a large
part of a day off to deal with the ravages.        --- Amis _On Drink_

[toc] | [prev] | [next] | [standalone]

#99966

From	Peter Otten <__peter__@web.de>
Date	2015-12-03 17:11 +0100
Message-ID	<mailman.174.1449159118.14615.python-list@python.org>
In reply to	#99962

Adam Funk wrote:

> On 2015-12-03, Adam Funk wrote:
> 
>> I'm having trouble with some input files that are almost all proper
>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>> like to ignore instead of throwing ValueError.  I've found the
>> openhook for the encoding
>>
>> for line in fileinput.input(options.files,
>> openhook=fileinput.hook_encoded("utf-8")):
>>     do_stuff(line)
>>
>> which the documentation describes as "a hook which opens each file
>> with codecs.open(), using the given encoding to read the file", but
>> I'd like codecs.open() to also have the errors='ignore' or
>> errors='replace' effect.  Is it possible to do this?
> 
> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
> machines).

Have a look at the source of fileinput.hook_encoded:

def hook_encoded(encoding):
    import io
    def openhook(filename, mode):
        mode = mode.replace('U', '').replace('b', '') or 'r'
        return io.open(filename, mode, encoding=encoding, newline='')
    return openhook

You can use it as a template to write your own factory function:

def my_hook_encoded(encoding, errors=None):
    import io
    def openhook(filename, mode):
        mode = mode.replace('U', '').replace('b', '') or 'r'
        return io.open(
            filename, mode, 
            encoding=encoding, newline='', 
            errors=errors)
    return openhook

for line in fileinput.input(
        options.files,
        openhook=my_hook_encoded("utf-8", errors="ignore")):
    do_stuff(line)

Another option is to create the function on the fly:

for line in fileinput.input(
        options.files,
        openhook=functools.partial(
            io.open, encoding="utf-8", errors="replace")):
    do_stuff(line)

(codecs.open() instead of io.open() should also work)

[toc] | [prev] | [next] | [standalone]

#99981

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-03 19:17 +0000
Message-ID	<c3h6jcxtas.ln2@news.ducksburg.com>
In reply to	#99966

On 2015-12-03, Peter Otten wrote:

> def my_hook_encoded(encoding, errors=None):
>     import io
>     def openhook(filename, mode):
>         mode = mode.replace('U', '').replace('b', '') or 'r'
>         return io.open(
>             filename, mode, 
>             encoding=encoding, newline='', 
>             errors=errors)
>     return openhook
>
> for line in fileinput.input(
>         options.files,
>         openhook=my_hook_encoded("utf-8", errors="ignore")):
>     do_stuff(line)

Perfect, thanks!


> (codecs.open() instead of io.open() should also work)

OK.


-- 
The internet is quite simply a glorious place. Where else can you find
bootlegged music and films, questionable women, deep seated xenophobia
and amusing cats all together in the same place?       --- Tom Belshaw

[toc] | [prev] | [next] | [standalone]

#99973

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-12-03 11:48 -0500
Message-ID	<mailman.180.1449161351.14615.python-list@python.org>
In reply to	#99962

On 12/3/2015 10:18 AM, Adam Funk wrote:
> On 2015-12-03, Adam Funk wrote:
>
>> I'm having trouble with some input files that are almost all proper
>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>> like to ignore instead of throwing ValueError.  I've found the
>> openhook for the encoding
>>
>> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>>      do_stuff(line)
>>
>> which the documentation describes as "a hook which opens each file
>> with codecs.open(), using the given encoding to read the file", but
>> I'd like codecs.open() to also have the errors='ignore' or
>> errors='replace' effect.  Is it possible to do this?
>
> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
> machines).

fileinput is an ancient module that predates iterators (and generators) 
and context managers. Since by 2.7 open files are both context managers 
and line iterators, you can easily write your own multi-file line 
iteration that does exactly what you want.  At minimum:

for file in files:
     with codecs.open(file, errors='ignore') as f
     # did not look up signature,
         for line in f:
             do_stuff(line)

To make this reusable, wrap in 'def filelines(files):' and replace 
'do_stuff(line)' with 'yield line'.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#99983

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-03 19:21 +0000
Message-ID	<5bh6jcxtas.ln2@news.ducksburg.com>
In reply to	#99973

On 2015-12-03, Terry Reedy wrote:

> fileinput is an ancient module that predates iterators (and generators) 
> and context managers. Since by 2.7 open files are both context managers 
> and line iterators, you can easily write your own multi-file line 
> iteration that does exactly what you want.  At minimum:
>
> for file in files:
>      with codecs.open(file, errors='ignore') as f
>      # did not look up signature,
>          for line in f:
>              do_stuff(line)
>
> To make this reusable, wrap in 'def filelines(files):' and replace 
> 'do_stuff(line)' with 'yield line'.

I like fileinput because if the file list is empty, it reads from
stdin instead (so I can pipe something else's output into it).
Unfortunately, the fix I got elsewhere in this thread doesn't seem to
work for that!


-- 
Science is what we understand well enough to explain to a computer.  
Art is everything else we do.                      --- Donald Knuth

[toc] | [prev] | [next] | [standalone]

#99987

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2015-12-03 22:26 +0000
Message-ID	<mailman.187.1449181591.14615.python-list@python.org>
In reply to	#99962

On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote:
>
> On 12/3/2015 10:18 AM, Adam Funk wrote:
>>
>> On 2015-12-03, Adam Funk wrote:
>>
>>> I'm having trouble with some input files that are almost all proper
>>> UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>> like to ignore instead of throwing ValueError.  I've found the
>>> openhook for the encoding
>>>
>>> for line in fileinput.input(options.files,
openhook=fileinput.hook_encoded("utf-8")):
>>>      do_stuff(line)
>>>
>>> which the documentation describes as "a hook which opens each file
>>> with codecs.open(), using the given encoding to read the file", but
>>> I'd like codecs.open() to also have the errors='ignore' or
>>> errors='replace' effect.  Is it possible to do this?
>>
>>
>> I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different
>> machines).
>
>
> fileinput is an ancient module that predates iterators (and generators)
and context managers. Since by 2.7 open files are both context managers and
line iterators, you can easily write your own multi-file line iteration
that does exactly what you want.  At minimum:
>
> for file in files:
>     with codecs.open(file, errors='ignore') as f
>     # did not look up signature,
>         for line in f:
>             do_stuff(line)

The above is fine but...

> To make this reusable, wrap in 'def filelines(files):' and replace
'do_stuff(line)' with 'yield line'.

That doesn't work entirely correctly as you end up yielding from inside a
with statement. If the user of your generator function doesn't fully
consume the generator then whichever file is currently open is not
guaranteed to be closed.

--
Oscar

[toc] | [prev] | [next] | [standalone]

#99997

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2015-12-04 10:34 +0200
Message-ID	<mailman.191.1449218096.14615.python-list@python.org>
In reply to	#99962

On 04.12.15 00:26, Oscar Benjamin wrote:
> On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote:
>> fileinput is an ancient module that predates iterators (and generators)
> and context managers. Since by 2.7 open files are both context managers and
> line iterators, you can easily write your own multi-file line iteration
> that does exactly what you want.  At minimum:
>>
>> for file in files:
>>      with codecs.open(file, errors='ignore') as f
>>      # did not look up signature,
>>          for line in f:
>>              do_stuff(line)
>
> The above is fine but...
>
>> To make this reusable, wrap in 'def filelines(files):' and replace
> 'do_stuff(line)' with 'yield line'.
>
> That doesn't work entirely correctly as you end up yielding from inside a
> with statement. If the user of your generator function doesn't fully
> consume the generator then whichever file is currently open is not
> guaranteed to be closed.

You can convert the generator to context manager and use it in the with 
statement to guarantee closing.

with contextlib.closing(filelines(files)) as f:
     for line in f:
         ...

[toc] | [prev] | [next] | [standalone]

#99999

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2015-12-04 09:00 +0000
Message-ID	<mailman.193.1449226525.14615.python-list@python.org>
In reply to	#99962

On 4 Dec 2015 08:36, "Serhiy Storchaka" <storchaka@gmail.com> wrote:
>
> On 04.12.15 00:26, Oscar Benjamin wrote:
>>
>> On 3 Dec 2015 16:50, "Terry Reedy" <tjreedy@udel.edu> wrote:
>>>
>>> fileinput is an ancient module that predates iterators (and generators)
>>
>> and context managers. Since by 2.7 open files are both context managers
and
>> line iterators, you can easily write your own multi-file line iteration
>> that does exactly what you want.  At minimum:
>>>
>>>
>>> for file in files:
>>>      with codecs.open(file, errors='ignore') as f
>>>      # did not look up signature,
>>>          for line in f:
>>>              do_stuff(line)
>>
>>
>> The above is fine but...
>>
>>> To make this reusable, wrap in 'def filelines(files):' and replace
>>
>> 'do_stuff(line)' with 'yield line'.
>>
>> That doesn't work entirely correctly as you end up yielding from inside a
>> with statement. If the user of your generator function doesn't fully
>> consume the generator then whichever file is currently open is not
>> guaranteed to be closed.
>
>
> You can convert the generator to context manager and use it in the with
statement to guarantee closing.
>
> with contextlib.closing(filelines(files)) as f:
>     for line in f:
>         ...

Or you can use fileinput which is designed to be exactly this kind of
context manager and to be used in this way. Although fileinput is slightly
awkward in defaulting to reading stdin.

--
Oscar

[toc] | [prev] | [next] | [standalone]

#100086

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-07 14:46 +0000
Message-ID	<2oigjcx5nh.ln2@news.ducksburg.com>
In reply to	#99999

On 2015-12-04, Oscar Benjamin wrote:

> Or you can use fileinput which is designed to be exactly this kind of
> context manager and to be used in this way. Although fileinput is slightly
> awkward in defaulting to reading stdin.

That default is what I specifically like about fileinput --- it's a
normal way for command-line tools to work:

$ sort file0 file1 file2 >sorted.txt
$ generate_junk | sort >sorted_junk.txt

-- 
      $2.95!
 PLATE O' SHRIMP
Luncheon Special

[toc] | [prev] | [next] | [standalone]

#99967

From	MRAB <python@mrabarnett.plus.com>
Date	2015-12-03 16:12 +0000
Message-ID	<mailman.175.1449159185.14615.python-list@python.org>
In reply to	#99961

On 2015-12-03 15:12, Adam Funk wrote:
> I'm having trouble with some input files that are almost all proper
> UTF-8 but with a couple of troublesome characters mixed in, which I'd
> like to ignore instead of throwing ValueError.  I've found the
> openhook for the encoding
>
> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>      do_stuff(line)
>
> which the documentation describes as "a hook which opens each file
> with codecs.open(), using the given encoding to read the file", but
> I'd like codecs.open() to also have the errors='ignore' or
> errors='replace' effect.  Is it possible to do this?
>
It looks like it's not possible with the standard "hook_encoded", but
you could write your own alternative:

import codecs

def my_hook_encoded(encoding, errors):

     def opener(path, mode):
         return codecs.open(path, mode, encoding=encoding, errors=errors)

     return opener

for line in fileinput.input(options.files, 
openhook=fileinput.my_hook_encoded("utf-8", "ignore")):
     do_stuff(line)

[toc] | [prev] | [next] | [standalone]

#99972

From	Laura Creighton <lac@openend.se>
Date	2015-12-03 17:46 +0100
Message-ID	<mailman.179.1449161213.14615.python-list@python.org>
In reply to	#99961

In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>I'm having trouble with some input files that are almost all proper
>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>like to ignore instead of throwing ValueError.  I've found the
>openhook for the encoding
>
>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>    do_stuff(line)
>
>which the documentation describes as "a hook which opens each file
>with codecs.open(), using the given encoding to read the file", but
>I'd like codecs.open() to also have the errors='ignore' or
>errors='replace' effect.  Is it possible to do this?
>
>Thanks.

This should be both easy to add, and useful, and I happen to know that
fileinput is being hacked on by Serhiy Storchaka right now, who agrees
that this would be easy.  So, with his approval, I stuck this into the
tracker.  http://bugs.python.org/issue25788  

Future Pythons may not have the problem.

Laura

[toc] | [prev] | [next] | [standalone]

#99982

From	Adam Funk <a24061@ducksburg.com>
Date	2015-12-03 19:17 +0000
Message-ID	<v3h6jcxtas.ln2@news.ducksburg.com>
In reply to	#99972

On 2015-12-03, Laura Creighton wrote:

> In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>>I'm having trouble with some input files that are almost all proper
>>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>like to ignore instead of throwing ValueError.  I've found the
>>openhook for the encoding
>>
>>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>>    do_stuff(line)
>>
>>which the documentation describes as "a hook which opens each file
>>with codecs.open(), using the given encoding to read the file", but
>>I'd like codecs.open() to also have the errors='ignore' or
>>errors='replace' effect.  Is it possible to do this?
>>
>>Thanks.
>
> This should be both easy to add, and useful, and I happen to know that
> fileinput is being hacked on by Serhiy Storchaka right now, who agrees
> that this would be easy.  So, with his approval, I stuck this into the
> tracker.  http://bugs.python.org/issue25788  
>
> Future Pythons may not have the problem.

Good to know, thanks.


-- 
You cannot really appreciate Dilbert unless you've read it in the
original Klingon.                  --- Klingon Programmer's Guide

[toc] | [prev] | [next] | [standalone]

#99985

From	Laura Creighton <lac@openend.se>
Date	2015-12-03 21:40 +0100
Message-ID	<mailman.185.1449175233.14615.python-list@python.org>
In reply to	#99982

In a message of Thu, 03 Dec 2015 19:17:51 +0000, Adam Funk writes:
>On 2015-12-03, Laura Creighton wrote:
>
>> In a message of Thu, 03 Dec 2015 15:12:15 +0000, Adam Funk writes:
>>>I'm having trouble with some input files that are almost all proper
>>>UTF-8 but with a couple of troublesome characters mixed in, which I'd
>>>like to ignore instead of throwing ValueError.  I've found the
>>>openhook for the encoding
>>>
>>>for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")):
>>>    do_stuff(line)
>>>
>>>which the documentation describes as "a hook which opens each file
>>>with codecs.open(), using the given encoding to read the file", but
>>>I'd like codecs.open() to also have the errors='ignore' or
>>>errors='replace' effect.  Is it possible to do this?
>>>
>>>Thanks.
>>
>> This should be both easy to add, and useful, and I happen to know that
>> fileinput is being hacked on by Serhiy Storchaka right now, who agrees
>> that this would be easy.  So, with his approval, I stuck this into the
>> tracker.  http://bugs.python.org/issue25788  
>>
>> Future Pythons may not have the problem.
>
>Good to know, thanks.

Well, we have moved right along to 'You write the patch, Laura' so I
can pretty much guarantee that future Pythons won't have the problem. :)

Laura

[toc] | [prev] | [standalone]

csiph-web

getting fileinput to do errors='ignore' or 'replace'?

Contents

#99961 — getting fileinput to do errors='ignore' or 'replace'?

#99962

#99966

#99981

#99973

#99983

#99987

#99997

#99999

#100086

#99967

#99972

#99982

#99985