Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Peter Otten <__peter__@web.de>
Newsgroups: comp.lang.python
Subject: Re: non printable (moving away from Perl)
Date: Fri, 11 Mar 2016 16:22:16 +0100
Organization: None
Lines: 62
Message-ID: <mailman.24.1457709748.26429.python-list@python.org>
References: <nbt27u$fe7$1@gioia.aioe.org> <mailman.17.1457698399.26429.python-list@python.org> <nbukcd$gs2$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
User-Agent: KNode/4.13.3
Precedence: list
Xref: csiph.com comp.lang.python:104621

Fillmore wrote:

> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
> 
> seems computationally heavy. I have a file with about 70k lines, of which
> only 20 contain "funny" chars.
> 
> ANy idea on how I can create a script that compares Perl speed vs. Python
> speed in performing the cleaning operation?

Try 

for line in ...:
    if has_nonprint(line):
        continue
    ...

with the has_nonprint() function as defined below:

$ cat isprint.py
import sys
import unicodedata


class Lookup(dict):
    def __missing__(self, n):
        c = chr(n)
        cat = unicodedata.category(c)
        if cat in {'Cs', 'Cn', 'Zl', 'Cc', 'Zp'}:
            self[n] = c
            return c
        else:
            self[n] = None
            return None


lookup = Lookup()
lookup[10] = None # allow newline

def has_nonprint(s):
    return bool(s.translate(lookup))

$ python3 -i isprint.py
>>> has_nonprint("foo")
False
>>> has_nonprint("foo\n")
False
>>> has_nonprint("foo\t")
True
>>> has_nonprint("\0foo")
True