Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #104564 > unrolled thread
| Started by | Fillmore <fillmore_remove@hotmail.com> |
|---|---|
| First post | 2016-03-10 19:07 -0500 |
| Last post | 2016-03-12 06:52 +1100 |
| Articles | 14 — 8 participants |
Back to article view | Back to comp.lang.python
non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-10 19:07 -0500
Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 17:25 -0700
Re: non printable (moving away from Perl) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-11 01:30 +0000
Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-10 20:52 -0700
Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:13 +0100
Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 09:23 -0500
Re: non printable (moving away from Perl) Peter Otten <__peter__@web.de> - 2016-03-11 16:22 +0100
Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 17:34 +0100
Re: non printable (moving away from Perl) Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-11 10:08 -0700
Re: non printable (moving away from Perl) Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-03-11 13:17 +0100
Re: non printable (moving away from Perl) Marko Rauhamaa <marko@pacujo.net> - 2016-03-11 14:47 +0200
Re: non printable (moving away from Perl) MRAB <python@mrabarnett.plus.com> - 2016-03-11 19:23 +0000
Re: non printable (moving away from Perl) Fillmore <fillmore_remove@hotmail.com> - 2016-03-11 14:36 -0500
Re: non printable (moving away from Perl) Ben Finney <ben+python@benfinney.id.au> - 2016-03-12 06:52 +1100
| From | Fillmore <fillmore_remove@hotmail.com> |
|---|---|
| Date | 2016-03-10 19:07 -0500 |
| Subject | non printable (moving away from Perl) |
| Message-ID | <nbt27u$fe7$1@gioia.aioe.org> |
Here's another handy Perl regex which I am not sure how to translate to
Python.
I use it to avoid processing lines that contain funny chars...
if ($string =~ /[^[:print:]]/) {next OUTER;}
:)
[toc] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2016-03-10 17:25 -0700 |
| Message-ID | <mailman.162.1457655958.15725.python-list@python.org> |
| In reply to | #104564 |
On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>
>
> Here's another handy Perl regex which I am not sure how to translate to
Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
Python's re module doesn't support POSIX character classes, but the regex
module on PyPI does.
https://pypi.python.org/pypi/regex
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2016-03-11 01:30 +0000 |
| Message-ID | <mailman.164.1457659889.15725.python-list@python.org> |
| In reply to | #104564 |
On 11/03/2016 00:25, Ian Kelly wrote:
> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>
>>
>> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> Python's re module doesn't support POSIX character classes, but the regex
> module on PyPI does.
>
> https://pypi.python.org/pypi/regex
>
There are plenty of testers for the re module, but do you know if there
are any available for the above, as it's not the easiest thing to search
for?
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2016-03-10 20:52 -0700 |
| Message-ID | <mailman.170.1457668375.15725.python-list@python.org> |
| In reply to | #104564 |
On Mar 10, 2016 6:33 PM, "Mark Lawrence" <breamoreboy@yahoo.co.uk> wrote:
>
> On 11/03/2016 00:25, Ian Kelly wrote:
>>
>> On Mar 10, 2016 5:15 PM, "Fillmore" <fillmore_remove@hotmail.com> wrote:
>>>
>>>
>>>
>>> Here's another handy Perl regex which I am not sure how to translate to
>>
>> Python.
>>>
>>>
>>> I use it to avoid processing lines that contain funny chars...
>>>
>>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>>
>> Python's re module doesn't support POSIX character classes, but the regex
>> module on PyPI does.
>>
>> https://pypi.python.org/pypi/regex
>>
>
> There are plenty of testers for the re module, but do you know if there
are any available for the above, as it's not the easiest thing to search
for?
No idea.
[toc] | [prev] | [next] | [standalone]
| From | Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> |
|---|---|
| Date | 2016-03-11 13:13 +0100 |
| Message-ID | <mailman.17.1457698399.26429.python-list@python.org> |
| In reply to | #104564 |
One lesson for Perl regex users is that in Python many things can be
solved without regexes. How about defining:
printable = {chr(n) for n in range(32, 127)}
then using:
if (set(my_string) - set(printable)):
break
On 11.03.2016 01:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>
[toc] | [prev] | [next] | [standalone]
| From | Fillmore <fillmore_remove@hotmail.com> |
|---|---|
| Date | 2016-03-11 09:23 -0500 |
| Message-ID | <nbukcd$gs2$1@gioia.aioe.org> |
| In reply to | #104611 |
On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be solved without regexes.
> How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
> break
seems computationally heavy. I have a file with about 70k lines, of which only 20 contain "funny" chars.
ANy idea on how I can create a script that compares Perl speed vs. Python speed
in performing the cleaning operation?
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-03-11 16:22 +0100 |
| Message-ID | <mailman.24.1457709748.26429.python-list@python.org> |
| In reply to | #104618 |
Fillmore wrote:
> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>> break
>
> seems computationally heavy. I have a file with about 70k lines, of which
> only 20 contain "funny" chars.
>
> ANy idea on how I can create a script that compares Perl speed vs. Python
> speed in performing the cleaning operation?
Try
for line in ...:
if has_nonprint(line):
continue
...
with the has_nonprint() function as defined below:
$ cat isprint.py
import sys
import unicodedata
class Lookup(dict):
def __missing__(self, n):
c = chr(n)
cat = unicodedata.category(c)
if cat in {'Cs', 'Cn', 'Zl', 'Cc', 'Zp'}:
self[n] = c
return c
else:
self[n] = None
return None
lookup = Lookup()
lookup[10] = None # allow newline
def has_nonprint(s):
return bool(s.translate(lookup))
$ python3 -i isprint.py
>>> has_nonprint("foo")
False
>>> has_nonprint("foo\n")
False
>>> has_nonprint("foo\t")
True
>>> has_nonprint("\0foo")
True
[toc] | [prev] | [next] | [standalone]
| From | Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> |
|---|---|
| Date | 2016-03-11 17:34 +0100 |
| Message-ID | <mailman.26.1457714063.26429.python-list@python.org> |
| In reply to | #104618 |
On 11.03.2016 15:23, Fillmore wrote:
> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes.
>> How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>> break
>
> seems computationally heavy. I have a file with about 70k lines, of
> which only 20 contain "funny" chars.
>
Not sure what you call computationally heavy. I just test-parsed a 30 MB
file (28k lines) with:
with open(my_file) as i:
for line in i:
if set(line) - printable:
continue
and it finished in less than a second.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2016-03-11 10:08 -0700 |
| Message-ID | <mailman.28.1457716135.26429.python-list@python.org> |
| In reply to | #104618 |
On Fri, Mar 11, 2016 at 9:34 AM, Wolfgang Maier
<wolfgang.maier@biologie.uni-freiburg.de> wrote:
> On 11.03.2016 15:23, Fillmore wrote:
>>
>> On 03/11/2016 07:13 AM, Wolfgang Maier wrote:
>>>
>>> One lesson for Perl regex users is that in Python many things can be
>>> solved without regexes.
>>> How about defining:
>>>
>>> printable = {chr(n) for n in range(32, 127)}
>>>
>>> then using:
>>>
>>> if (set(my_string) - set(printable)):
>>> break
>>
>>
>> seems computationally heavy. I have a file with about 70k lines, of
>> which only 20 contain "funny" chars.
>>
>
> Not sure what you call computationally heavy. I just test-parsed a 30 MB
> file (28k lines) with:
>
> with open(my_file) as i:
> for line in i:
> if set(line) - printable:
> continue
>
> and it finished in less than a second.
Did your test file contain on the order of 100 unique characters, or
on the order of 100,000? Granted that most input data would likely
fall into the former category.
[toc] | [prev] | [next] | [standalone]
| From | Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> |
|---|---|
| Date | 2016-03-11 13:17 +0100 |
| Message-ID | <mailman.18.1457698808.26429.python-list@python.org> |
| In reply to | #104564 |
On 11.03.2016 13:13, Wolfgang Maier wrote:
> One lesson for Perl regex users is that in Python many things can be
> solved without regexes. How about defining:
>
> printable = {chr(n) for n in range(32, 127)}
>
> then using:
>
> if (set(my_string) - set(printable)):
> break
>
Err, I meant:
if (set(my_string) - printable):
break
of course. No need to attempt another set conversion.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-11 14:47 +0200 |
| Message-ID | <8737rxgp0a.fsf@elektro.pacujo.net> |
| In reply to | #104612 |
Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>:
> On 11.03.2016 13:13, Wolfgang Maier wrote:
>> One lesson for Perl regex users is that in Python many things can be
>> solved without regexes. How about defining:
>>
>> printable = {chr(n) for n in range(32, 127)}
>>
>> then using:
>>
>> if (set(my_string) - set(printable)):
>> break
>>
>
> Err, I meant:
>
> if (set(my_string) - printable):
> break
>
> of course. No need to attempt another set conversion.
Most non-ASCII characters are printable, or at least a good many.
Unfortunately, "printable" doesn't seem to be a Unicode category.
Marko
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2016-03-11 19:23 +0000 |
| Message-ID | <mailman.0.1457724244.12893.python-list@python.org> |
| In reply to | #104564 |
On 2016-03-11 00:07, Fillmore wrote:
>
> Here's another handy Perl regex which I am not sure how to translate to
> Python.
>
> I use it to avoid processing lines that contain funny chars...
>
> if ($string =~ /[^[:print:]]/) {next OUTER;}
>
> :)
>
Python 3 (Unicode) strings have an .isprintable method:
mystring.isprintable()
[toc] | [prev] | [next] | [standalone]
| From | Fillmore <fillmore_remove@hotmail.com> |
|---|---|
| Date | 2016-03-11 14:36 -0500 |
| Message-ID | <nbv6n7$1eu7$1@gioia.aioe.org> |
| In reply to | #104627 |
On 3/11/2016 2:23 PM, MRAB wrote:
> On 2016-03-11 00:07, Fillmore wrote:
>>
>> Here's another handy Perl regex which I am not sure how to translate to
>> Python.
>>
>> I use it to avoid processing lines that contain funny chars...
>>
>> if ($string =~ /[^[:print:]]/) {next OUTER;}
>>
>> :)
>>
> Python 3 (Unicode) strings have an .isprintable method:
>
> mystring.isprintable()
>
my strings are UTF-8. Will it work there too?
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2016-03-12 06:52 +1100 |
| Message-ID | <mailman.2.1457725975.12893.python-list@python.org> |
| In reply to | #104629 |
Fillmore <fillmore_remove@hotmail.com> writes:
> On 3/11/2016 2:23 PM, MRAB wrote:
> > Python 3 (Unicode) strings have an .isprintable method:
> >
> > mystring.isprintable()
>
> my strings are UTF-8. Will it work there too?
You need to always be clear on the difference between text (the Python 3
‘str’ type) versus bytes.
It only makes sense to talk about an encoding, when talking about bytes.
Text itself is an abstract data type; the content of a Unicode string
does not have any encoding because it is not encoded.
The content of a byte stream (such as a file's content) is not text, it
is bytes.
>>> foo = "こんにちは"
>>> foo.isprintable()
True
>>> foo_encoded = foo.encode("utf-8")
>>> foo_encoded.isprintable()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'isprintable'
You can only ask ‘isprintable’ about text. Bytes are not printable
because bytes are not text; you need to decode the bytes to text before
asking whether that text is printable.
>>> infile = open('lorem.txt', 'rb')
>>> infile_bytes = infile.read()
>>> infile_bytes.isprintable()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'isprintable'
>>> infile = open('lorem.txt', 'rt', encoding="utf-8")
>>> infile_text = infile.read()
>>> infile_text.isprintable()
True
--
\ “Telling pious lies to trusting children is a form of abuse, |
`\ plain and simple.” —Daniel Dennett, 2010-01-12 |
_o__) |
Ben Finney
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web