Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #104564 > unrolled thread

non printable (moving away from Perl)

Started byFillmore <fillmore_remove@hotmail.com>
First post2016-03-10 19:07 -0500
Last post2016-03-12 06:52 +1100
Articles 14 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-10 19:07 -0500
    Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 17:25 -0700
    Re: non printable (moving away from Perl) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-11 01:30 +0000
    Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 20:52 -0700
    Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:13 +0100
      Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 09:23 -0500
        Re: non printable (moving away from Perl) Peter Otten <__peter__@web.de> - 2016-03-11 16:22 +0100
        Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 17:34 +0100
        Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-11 10:08 -0700
    Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:17 +0100
      Re: non printable (moving away from Perl) Marko Rauhamaa <marko@pacujo.net> - 2016-03-11 14:47 +0200
    Re: non printable (moving away from Perl) MRAB <python@mrabarnett.plus.com> - 2016-03-11 19:23 +0000
      Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 14:36 -0500
        Re: non printable (moving away from Perl) Ben Finney <ben+python@benfinney.id.au> - 2016-03-12 06:52 +1100

#104564 — non printable (moving away from Perl)

FromFillmore <fillmore_remove@hotmail.com>
Date2016-03-10 19:07 -0500
Subjectnon printable (moving away from Perl)
Message-ID<nbt27u$fe7$1@gioia.aioe.org>
Here's another handy Perl regex which I am not sure how to translate to 
Python.

I use it to avoid processing lines that contain funny chars...

if ($string =~ /[^[:print:]]/) {next OUTER;}

:)

[toc] | [next] | [standalone]


#104566

FromIan Kelly <ian.g.kelly@gmail.com>
Date2016-03-10 17:25 -0700
Message-ID<mailman.162.1457655958.15725.python-list@python.org>
In reply to#104564
On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>
>
> Here's another handy Perl regex which I am not sure how to translate to
Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}

Python's re module doesn't support POSIX character classes, but the regex
module on PyPI does.

https://pypi.python.org/pypi/regex

[toc] | [prev] | [next] | [standalone]


#104571

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2016-03-11 01:30 +0000
Message-ID<mailman.164.1457659889.15725.python-list@python.org>
In reply to#104564
On 11/03/2016 00:25, Ian Kelly wrote:
> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>
>>
>> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> Python's re module doesn't support POSIX character classes, but the regex
> module on PyPI does.
>
> https://pypi.python.org/pypi/regex
>

There are plenty of testers for the re module, but do you know if there 
are any available for the above, as it's not the easiest thing to search 
for?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#104579

FromIan Kelly <ian.g.kelly@gmail.com>
Date2016-03-10 20:52 -0700
Message-ID<mailman.170.1457668375.15725.python-list@python.org>
In reply to#104564
On Mar 10, 2016 6:33 PM, "Mark Lawrence" <breamoreboy@yahoo.co.uk> wrote:
>
> On 11/03/2016 00:25, Ian Kelly wrote:
>>
>> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>>
>>>
>>>
>>> Here's another handy Perl regex which I am not sure how to translate to
>>
>> Python.
>>>
>>>
>>> I use it to avoid processing lines that contain funny chars...
>>>
>>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>>
>> Python's re module doesn't support POSIX character classes, but the regex
>> module on PyPI does.
>>
>> https://pypi.python.org/pypi/regex
>>
>
> There are plenty of testers for the re module, but do you know if there
are any available for the above, as it's not the easiest thing to search
for?

No idea.

[toc] | [prev] | [next] | [standalone]


#104611

FromWolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date2016-03-11 13:13 +0100
Message-ID<mailman.17.1457698399.26429.python-list@python.org>
In reply to#104564
One lesson for Perl regex users is that in Python many things can be 
solved without regexes. How about defining:

printable = {chr(n) for n in range(32, 127)}

then using:

if (set(my_string) - set(printable)):
     break



On 11.03.2016 01:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>

[toc] | [prev] | [next] | [standalone]


#104618

FromFillmore <fillmore_remove@hotmail.com>
Date2016-03-11 09:23 -0500
Message-ID<nbukcd$gs2$1@gioia.aioe.org>
In reply to#104611
On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be solved without regexes.
> How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
>      break

seems computationally heavy. I have a file with about 70k lines, of which only 20 contain "funny" chars.

ANy idea on how I can create a script that compares Perl speed vs. Python speed
in performing the cleaning operation?

[toc] | [prev] | [next] | [standalone]


#104621

FromPeter Otten <__peter__@web.de>
Date2016-03-11 16:22 +0100
Message-ID<mailman.24.1457709748.26429.python-list@python.org>
In reply to#104618
Fillmore wrote:

> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
> 
> seems computationally heavy. I have a file with about 70k lines, of which
> only 20 contain "funny" chars.
> 
> ANy idea on how I can create a script that compares Perl speed vs. Python
> speed in performing the cleaning operation?

Try 

for line in ...:
    if has_nonprint(line):
        continue
    ...

with the has_nonprint() function as defined below:

$ cat isprint.py
import sys
import unicodedata


class Lookup(dict):
    def __missing__(self, n):
        c = chr(n)
        cat = unicodedata.category(c)
        if cat in {'Cs', 'Cn', 'Zl', 'Cc', 'Zp'}:
            self[n] = c
            return c
        else:
            self[n] = None
            return None


lookup = Lookup()
lookup[10] = None # allow newline

def has_nonprint(s):
    return bool(s.translate(lookup))

$ python3 -i isprint.py
>>> has_nonprint("foo")
False
>>> has_nonprint("foo\n")
False
>>> has_nonprint("foo\t")
True
>>> has_nonprint("\0foo")
True

[toc] | [prev] | [next] | [standalone]


#104623

FromWolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date2016-03-11 17:34 +0100
Message-ID<mailman.26.1457714063.26429.python-list@python.org>
In reply to#104618
On 11.03.2016 15:23, Fillmore wrote:
> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes.
>> How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
>
> seems computationally heavy. I have a file with about 70k lines, of
> which only 20 contain "funny" chars.
>

Not sure what you call computationally heavy. I just test-parsed a 30 MB 
file (28k lines) with:

with open(my_file) as i:
     for line in i:
         if set(line) - printable:
             continue

and it finished in less than a second.

[toc] | [prev] | [next] | [standalone]


#104625

FromIan Kelly <ian.g.kelly@gmail.com>
Date2016-03-11 10:08 -0700
Message-ID<mailman.28.1457716135.26429.python-list@python.org>
In reply to#104618
On Fri, Mar 11, 2016 at 9:34 AM, Wolfgang Maier
<wolfgang.maier@biologie.uni-freiburg.de> wrote:
> On 11.03.2016 15:23, Fillmore wrote:
>>
>> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>>>
>>> One lesson for Perl regex users is that in Python many things can be
>>> solved without regexes.
>>> How about defining:
>>>
>>> printable = {chr(n) for n in range(32, 127)}
>>>
>>> then using:
>>>
>>> if (set(my_string) - set(printable)):
>>>      break
>>
>>
>> seems computationally heavy. I have a file with about 70k lines, of
>> which only 20 contain "funny" chars.
>>
>
> Not sure what you call computationally heavy. I just test-parsed a 30 MB
> file (28k lines) with:
>
> with open(my_file) as i:
>     for line in i:
>         if set(line) - printable:
>             continue
>
> and it finished in less than a second.

Did your test file contain on the order of 100 unique characters, or
on the order of 100,000?  Granted that most input data would likely
fall into the former category.

[toc] | [prev] | [next] | [standalone]


#104612

FromWolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date2016-03-11 13:17 +0100
Message-ID<mailman.18.1457698808.26429.python-list@python.org>
In reply to#104564
On 11.03.2016 13:13, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be
> solved without regexes. How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
>      break
>

Err, I meant:

if (set(my_string) - printable):
     break

of course. No need to attempt another set conversion.

[toc] | [prev] | [next] | [standalone]


#104614

FromMarko Rauhamaa <marko@pacujo.net>
Date2016-03-11 14:47 +0200
Message-ID<8737rxgp0a.fsf@elektro.pacujo.net>
In reply to#104612
Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>:

> On 11.03.2016 13:13, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>>      break
>>
>
> Err, I meant:
>
> if (set(my_string) - printable):
>     break
>
> of course. No need to attempt another set conversion.

Most non-ASCII characters are printable, or at least a good many.

Unfortunately, "printable" doesn't seem to be a Unicode category.


Marko

[toc] | [prev] | [next] | [standalone]


#104627

FromMRAB <python@mrabarnett.plus.com>
Date2016-03-11 19:23 +0000
Message-ID<mailman.0.1457724244.12893.python-list@python.org>
In reply to#104564
On 2016-03-11 00:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>
Python 3 (Unicode) strings have an .isprintable method:

mystring.isprintable()

[toc] | [prev] | [next] | [standalone]


#104629

FromFillmore <fillmore_remove@hotmail.com>
Date2016-03-11 14:36 -0500
Message-ID<nbv6n7$1eu7$1@gioia.aioe.org>
In reply to#104627
On 3/11/2016 2:23 PM, MRAB wrote:
> On 2016-03-11 00:07, Fillmore wrote:
>>
>> Here's another handy Perl regex which I am not sure how to translate to
>> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>> :)
>>
> Python 3 (Unicode) strings have an .isprintable method:
>
> mystring.isprintable()
>

my strings are UTF-8. Will it work there too?

[toc] | [prev] | [next] | [standalone]


#104633

FromBen Finney <ben+python@benfinney.id.au>
Date2016-03-12 06:52 +1100
Message-ID<mailman.2.1457725975.12893.python-list@python.org>
In reply to#104629
Fillmore <fillmore_remove@hotmail.com> writes:

> On 3/11/2016 2:23 PM, MRAB wrote:
> > Python 3 (Unicode) strings have an .isprintable method:
> >
> > mystring.isprintable()
>
> my strings are UTF-8. Will it work there too?

You need to always be clear on the difference between text (the Python 3
‘str’ type) versus bytes.

It only makes sense to talk about an encoding, when talking about bytes.

Text itself is an abstract data type; the content of a Unicode string
does not have any encoding because it is not encoded.

The content of a byte stream (such as a file's content) is not text, it
is bytes.

    >>> foo = "こんにちは"
    >>> foo.isprintable()
    True

    >>> foo_encoded = foo.encode("utf-8")
    >>> foo_encoded.isprintable()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'isprintable'

You can only ask ‘isprintable’ about text. Bytes are not printable
because bytes are not text; you need to decode the bytes to text before
asking whether that text is printable.

    >>> infile = open('lorem.txt', 'rb')
    >>> infile_bytes = infile.read()
    >>> infile_bytes.isprintable()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'isprintable'

    >>> infile = open('lorem.txt', 'rt', encoding="utf-8")
    >>> infile_text = infile.read()
    >>> infile_text.isprintable()
    True

-- 
 \        “Telling pious lies to trusting children is a form of abuse, |
  `\                    plain and simple.” —Daniel Dennett, 2010-01-12 |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web