Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Terry Reedy Newsgroups: comp.lang.python Subject: Re: getting fileinput to do errors='ignore' or 'replace'? Date: Thu, 3 Dec 2015 11:48:13 -0500 Lines: 36 Message-ID: References: <8336jcxi2m.ln2@news.ducksburg.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de F87T+r2Ja5hjBMpNihQIhANNNFAE2ssttIqB5KpyNuJw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'context': 0.05; "subject:' ": 0.07; 'subject:getting': 0.07; 'utf-8': 0.07; 'effect.': 0.09; 'files:': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:ignore': 0.09; 'python': 0.10; 'jan': 0.11; '2.7': 0.13; 'ignore': 0.14; 'encoding': 0.15; 'skip:f 30': 0.15; '2.7.3': 0.16; 'adam': 0.16; 'iteration': 0.16; 'iterators': 0.16; 'iterators,': 0.16; 'received:80.91.229.3': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'reedy': 0.16; 'wrote:': 0.16; 'input': 0.18; '(on': 0.22; 'mixed': 0.22; 'am,': 0.23; 'forgot': 0.23; 'header:In-Reply- To:1': 0.24; 'module': 0.25; "i've": 0.25; 'header:User-Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; "skip:' 10": 0.28; "i'm": 0.30; "i'd": 0.31; 'wrap': 0.33; 'open': 0.33; 'file': 0.34; 'this?': 0.34; 'trouble': 0.35; 'replace': 0.35; 'but': 0.36; 'instead': 0.36; 'possible': 0.36; '(and': 0.36; 'received:71': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'files': 0.38; 'does': 0.39; 'easily': 0.39; 'to:addr:python.org': 0.40; 'some': 0.40; 'your': 0.60; 'different': 0.63; 'managers': 0.63; 'received:fios.verizon.net': 0.91 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: pool-71-185-227-36.phlapa.fios.verizon.net User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 In-Reply-To: <8336jcxi2m.ln2@news.ducksburg.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:99973 On 12/3/2015 10:18 AM, Adam Funk wrote: > On 2015-12-03, Adam Funk wrote: > >> I'm having trouble with some input files that are almost all proper >> UTF-8 but with a couple of troublesome characters mixed in, which I'd >> like to ignore instead of throwing ValueError. I've found the >> openhook for the encoding >> >> for line in fileinput.input(options.files, openhook=fileinput.hook_encoded("utf-8")): >> do_stuff(line) >> >> which the documentation describes as "a hook which opens each file >> with codecs.open(), using the given encoding to read the file", but >> I'd like codecs.open() to also have the errors='ignore' or >> errors='replace' effect. Is it possible to do this? > > I forgot to mention: this is for Python 2.7.3 & 2.7.10 (on different > machines). fileinput is an ancient module that predates iterators (and generators) and context managers. Since by 2.7 open files are both context managers and line iterators, you can easily write your own multi-file line iteration that does exactly what you want. At minimum: for file in files: with codecs.open(file, errors='ignore') as f # did not look up signature, for line in f: do_stuff(line) To make this reusable, wrap in 'def filelines(files):' and replace 'do_stuff(line)' with 'yield line'. -- Terry Jan Reedy