Groups > comp.lang.python > #104564 > unrolled thread

non printable (moving away from Perl)

Started by	Fillmore <fillmore_remove@hotmail.com>
First post	2016-03-10 19:07 -0500
Last post	2016-03-12 06:52 +1100
Articles	14 — 8 participants

Back to article view | Back to comp.lang.python

  non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-10 19:07 -0500
    Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 17:25 -0700
    Re: non printable (moving away from Perl) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-11 01:30 +0000
    Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 20:52 -0700
    Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:13 +0100
      Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 09:23 -0500
        Re: non printable (moving away from Perl) Peter Otten <__peter__@web.de> - 2016-03-11 16:22 +0100
        Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 17:34 +0100
        Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-11 10:08 -0700
    Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:17 +0100
      Re: non printable (moving away from Perl) Marko Rauhamaa <marko@pacujo.net> - 2016-03-11 14:47 +0200
    Re: non printable (moving away from Perl) MRAB <python@mrabarnett.plus.com> - 2016-03-11 19:23 +0000
      Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 14:36 -0500
        Re: non printable (moving away from Perl) Ben Finney <ben+python@benfinney.id.au> - 2016-03-12 06:52 +1100

#104564 — non printable (moving away from Perl)

From	Fillmore <fillmore_remove@hotmail.com>
Date	2016-03-10 19:07 -0500
Subject	non printable (moving away from Perl)
Message-ID	<nbt27u$fe7$1@gioia.aioe.org>

Here's another handy Perl regex which I am not sure how to translate to 
Python.

I use it to avoid processing lines that contain funny chars...

if ($string =~ /[^[:print:]]/) {next OUTER;}

:)

[toc] | [next] | [standalone]

#104566

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2016-03-10 17:25 -0700
Message-ID	<mailman.162.1457655958.15725.python-list@python.org>
In reply to	#104564

On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>
>
> Here's another handy Perl regex which I am not sure how to translate to
Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}

Python's re module doesn't support POSIX character classes, but the regex
module on PyPI does.

https://pypi.python.org/pypi/regex

[toc] | [prev] | [next] | [standalone]

#104571

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2016-03-11 01:30 +0000
Message-ID	<mailman.164.1457659889.15725.python-list@python.org>
In reply to	#104564

On 11/03/2016 00:25, Ian Kelly wrote:
> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>
>>
>> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> Python's re module doesn't support POSIX character classes, but the regex
> module on PyPI does.
>
> https://pypi.python.org/pypi/regex
>

There are plenty of testers for the re module, but do you know if there 
are any available for the above, as it's not the easiest thing to search 
for?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#104579

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2016-03-10 20:52 -0700
Message-ID	<mailman.170.1457668375.15725.python-list@python.org>
In reply to	#104564

On Mar 10, 2016 6:33 PM, "Mark Lawrence" <breamoreboy@yahoo.co.uk> wrote:
>
> On 11/03/2016 00:25, Ian Kelly wrote:
>>
>> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>>
>>>
>>>
>>> Here's another handy Perl regex which I am not sure how to translate to
>>
>> Python.
>>>
>>>
>>> I use it to avoid processing lines that contain funny chars...
>>>
>>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>>
>> Python's re module doesn't support POSIX character classes, but the regex
>> module on PyPI does.
>>
>> https://pypi.python.org/pypi/regex
>>
>
> There are plenty of testers for the re module, but do you know if there
are any available for the above, as it's not the easiest thing to search
for?

No idea.

[toc] | [prev] | [next] | [standalone]

#104611

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2016-03-11 13:13 +0100
Message-ID	<mailman.17.1457698399.26429.python-list@python.org>
In reply to	#104564

One lesson for Perl regex users is that in Python many things can be 
solved without regexes. How about defining:

printable = {chr(n) for n in range(32, 127)}

then using:

if (set(my_string) - set(printable)):
     break



On 11.03.2016 01:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>

[toc] | [prev] | [next] | [standalone]

#104618

From	Fillmore <fillmore_remove@hotmail.com>
Date	2016-03-11 09:23 -0500
Message-ID	<nbukcd$gs2$1@gioia.aioe.org>
In reply to	#104611

On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be solved without regexes.
> How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
>      break

seems computationally heavy. I have a file with about 70k lines, of which only 20 contain "funny" chars.

ANy idea on how I can create a script that compares Perl speed vs. Python speed
in performing the cleaning operation?

[toc] | [prev] | [next] | [standalone]

#104621

From	Peter Otten <__peter__@web.de>
Date	2016-03-11 16:22 +0100
Message-ID	<mailman.24.1457709748.26429.python-list@python.org>
In reply to	#104618

Fillmore wrote:

> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
> 
> seems computationally heavy. I have a file with about 70k lines, of which
> only 20 contain "funny" chars.
> 
> ANy idea on how I can create a script that compares Perl speed vs. Python
> speed in performing the cleaning operation?

Try 

for line in ...:
    if has_nonprint(line):
        continue
    ...

with the has_nonprint() function as defined below:

$ cat isprint.py
import sys
import unicodedata


class Lookup(dict):
    def __missing__(self, n):
        c = chr(n)
        cat = unicodedata.category(c)
        if cat in {'Cs', 'Cn', 'Zl', 'Cc', 'Zp'}:
            self[n] = c
            return c
        else:
            self[n] = None
            return None


lookup = Lookup()
lookup[10] = None # allow newline

def has_nonprint(s):
    return bool(s.translate(lookup))

$ python3 -i isprint.py
>>> has_nonprint("foo")
False
>>> has_nonprint("foo\n")
False
>>> has_nonprint("foo\t")
True
>>> has_nonprint("\0foo")
True

[toc] | [prev] | [next] | [standalone]

#104623

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2016-03-11 17:34 +0100
Message-ID	<mailman.26.1457714063.26429.python-list@python.org>
In reply to	#104618

On 11.03.2016 15:23, Fillmore wrote:
> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes.
>> How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
>
> seems computationally heavy. I have a file with about 70k lines, of
> which only 20 contain "funny" chars.
>

Not sure what you call computationally heavy. I just test-parsed a 30 MB 
file (28k lines) with:

with open(my_file) as i:
     for line in i:
         if set(line) - printable:
             continue

and it finished in less than a second.

[toc] | [prev] | [next] | [standalone]

#104625

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2016-03-11 10:08 -0700
Message-ID	<mailman.28.1457716135.26429.python-list@python.org>
In reply to	#104618

On Fri, Mar 11, 2016 at 9:34 AM, Wolfgang Maier
<wolfgang.maier@biologie.uni-freiburg.de> wrote:
> On 11.03.2016 15:23, Fillmore wrote:
>>
>> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>>>
>>> One lesson for Perl regex users is that in Python many things can be
>>> solved without regexes.
>>> How about defining:
>>>
>>> printable = {chr(n) for n in range(32, 127)}
>>>
>>> then using:
>>>
>>> if (set(my_string) - set(printable)):
>>>      break
>>
>>
>> seems computationally heavy. I have a file with about 70k lines, of
>> which only 20 contain "funny" chars.
>>
>
> Not sure what you call computationally heavy. I just test-parsed a 30 MB
> file (28k lines) with:
>
> with open(my_file) as i:
>     for line in i:
>         if set(line) - printable:
>             continue
>
> and it finished in less than a second.

Did your test file contain on the order of 100 unique characters, or
on the order of 100,000?  Granted that most input data would likely
fall into the former category.

[toc] | [prev] | [next] | [standalone]

#104612

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2016-03-11 13:17 +0100
Message-ID	<mailman.18.1457698808.26429.python-list@python.org>
In reply to	#104564

On 11.03.2016 13:13, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be
> solved without regexes. How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
>      break
>

Err, I meant:

if (set(my_string) - printable):
     break

of course. No need to attempt another set conversion.

[toc] | [prev] | [next] | [standalone]

#104614

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-03-11 14:47 +0200
Message-ID	<8737rxgp0a.fsf@elektro.pacujo.net>
In reply to	#104612

Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>:

> On 11.03.2016 13:13, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
>>
>
> Err, I meant:
>
> if (set(my_string) - printable):
>     break
>
> of course. No need to attempt another set conversion.

Most non-ASCII characters are printable, or at least a good many.

Unfortunately, "printable" doesn't seem to be a Unicode category.

Marko

[toc] | [prev] | [next] | [standalone]

#104627

From	MRAB <python@mrabarnett.plus.com>
Date	2016-03-11 19:23 +0000
Message-ID	<mailman.0.1457724244.12893.python-list@python.org>
In reply to	#104564

On 2016-03-11 00:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>
Python 3 (Unicode) strings have an .isprintable method:

mystring.isprintable()

[toc] | [prev] | [next] | [standalone]

#104629

From	Fillmore <fillmore_remove@hotmail.com>
Date	2016-03-11 14:36 -0500
Message-ID	<nbv6n7$1eu7$1@gioia.aioe.org>
In reply to	#104627

On 3/11/2016 2:23 PM, MRAB wrote:
> On 2016-03-11 00:07, Fillmore wrote:
>>
>> Here's another handy Perl regex which I am not sure how to translate to
>> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>> :)
>>
> Python 3 (Unicode) strings have an .isprintable method:
>
> mystring.isprintable()
>

my strings are UTF-8. Will it work there too?

[toc] | [prev] | [next] | [standalone]

#104633

From	Ben Finney <ben+python@benfinney.id.au>
Date	2016-03-12 06:52 +1100
Message-ID	<mailman.2.1457725975.12893.python-list@python.org>
In reply to	#104629

Fillmore <fillmore_remove@hotmail.com> writes:

> On 3/11/2016 2:23 PM, MRAB wrote:
> > Python 3 (Unicode) strings have an .isprintable method:
> >
> > mystring.isprintable()
>
> my strings are UTF-8. Will it work there too?

You need to always be clear on the difference between text (the Python 3
‘str’ type) versus bytes.

It only makes sense to talk about an encoding, when talking about bytes.

Text itself is an abstract data type; the content of a Unicode string
does not have any encoding because it is not encoded.

The content of a byte stream (such as a file's content) is not text, it
is bytes.

    >>> foo = "こんにちは"
    >>> foo.isprintable()
    True

    >>> foo_encoded = foo.encode("utf-8")
    >>> foo_encoded.isprintable()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'isprintable'

You can only ask ‘isprintable’ about text. Bytes are not printable
because bytes are not text; you need to decode the bytes to text before
asking whether that text is printable.

    >>> infile = open('lorem.txt', 'rb')
    >>> infile_bytes = infile.read()
    >>> infile_bytes.isprintable()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'isprintable'

    >>> infile = open('lorem.txt', 'rt', encoding="utf-8")
    >>> infile_text = infile.read()
    >>> infile_text.isprintable()
    True

-- 
 \        “Telling pious lies to trusting children is a form of abuse, |
  `\                    plain and simple.” —Daniel Dennett, 2010-01-12 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [standalone]

csiph-web

non printable (moving away from Perl)

Contents

#104564 — non printable (moving away from Perl)

#104566

#104571

#104579

#104611

#104618

#104621

#104623

#104625

#104612

#104614

#104627

#104629

#104633