Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #106778 > unrolled thread
| Started by | geshdus@gmail.com |
|---|---|
| First post | 2016-04-10 04:37 -0700 |
| Last post | 2016-04-10 17:52 +0200 |
| Articles | 5 — 4 participants |
Back to article view | Back to comp.lang.python
function to remove and punctuation geshdus@gmail.com - 2016-04-10 04:37 -0700
Re: function to remove and punctuation Steven D'Aprano <steve@pearwood.info> - 2016-04-10 22:08 +1000
Re: function to remove and punctuation Peter Otten <__peter__@web.de> - 2016-04-10 14:35 +0200
Re: function to remove and punctuation Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-10 16:23 +0200
Re: function to remove and punctuation Peter Otten <__peter__@web.de> - 2016-04-10 17:52 +0200
| From | geshdus@gmail.com |
|---|---|
| Date | 2016-04-10 04:37 -0700 |
| Subject | function to remove and punctuation |
| Message-ID | <3af95726-6f5c-4a2d-bf42-061efedd13b1@googlegroups.com> |
how to write a function taking a string parameter, which returns it after you delete the spaces, punctuation marks, accented characters in python ?
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-04-10 22:08 +1000 |
| Message-ID | <570a425d$0$1612$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #106778 |
On Sun, 10 Apr 2016 09:37 pm, geshdus@gmail.com wrote:
> how to write a function taking a string parameter, which returns it after
> you delete the spaces, punctuation marks, accented characters in python ?
In your text editor, open a new file.
Now bash your fingers onto the keyboard so that letters appear in the file.
When you have the function, click Save.
(Sorry, I couldn't resist.)
Here is one to get you started:
def remove_punctuation(the_string):
the_string = the_string.replace(".", "")
return the_string
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-04-10 14:35 +0200 |
| Message-ID | <mailman.2.1460291770.6211.python-list@python.org> |
| In reply to | #106778 |
geshdus@gmail.com wrote:
> how to write a function taking a string parameter, which returns it after
> you delete the spaces, punctuation marks, accented characters in python ?
Looks like you want to remove more characters than you want to keep. In this
case I'd decide what characters too keep first, e. g. (assuming Python 3)
>>> import string
>>> keep = string.ascii_letters + string.digits
>>> keep
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
Now you can iterate over the characters and check if you want to preserve it
for each of them:
>>> def clean(s, keep):
... return "".join(c for c in s if c in keep)
...
>>> clean("<alpha> äöü ::42", keep)
'alpha42'
>>> clean("<alpha> äöü ::42", string.ascii_letters)
'alpha'
If you are dealing with a lot of text you can make this a bit more efficient
with the str.translate() method. Create a mapping that maps all characters
that you want to keep to themselves
>>> m = str.maketrans(keep, keep)
>>> m[ord("a")]
97
>>> m[ord(">")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 62
and all characters that you want to discard to None
>>> from collections import defaultdict
>>> trans = defaultdict(lambda: None, m)
>>> trans[ord("s")]
115
>>> trans[ord("ß")] # returns None, so nothing is printed
>>>
Now pass it to the translate() method:
>>> "<alpha> äöü ::42".translate(trans)
'alpha42'
You changed your mind and want to translate " " to "_"? Here's how:
>>> trans[ord(" ")] = "_"
>>> "<alpha> äöü ::42".translate(trans)
'alpha__42'
>>> trans[ord(" ")] = "_"
>>> "<alpha> äöü ::42".translate(trans)
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2016-04-10 16:23 +0200 |
| Message-ID | <4030166.z2W5J3t6Yo@PointedEars.de> |
| In reply to | #106780 |
Peter Otten wrote:
> geshdus@gmail.com wrote:
>> how to write a function taking a string parameter, which returns it after
>> you delete the spaces, punctuation marks, accented characters in python ?
>
> Looks like you want to remove more characters than you want to keep. In
> this case I'd decide what characters too keep first, e. g. (assuming
> Python 3)
However, with *that* approach (which is different from the OP’s request),
regular expression matching might turn out to be more efficient:
-----------------------------------------------------------
import re
print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
-----------------------------------------------------------
With the OP’s original request, they may still be the better approach.
For example:
----------------------------------------------------------------------
import re
print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
flags=re.IGNORECASE)))
----------------------------------------------------------------------
or
----------------------------------------------------------------------
import re
print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
flags=re.IGNORECASE)))
----------------------------------------------------------------------
>>>> import string
>>>> keep = string.ascii_letters + string.digits
>>>> keep
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>
> Now you can iterate over the characters and check if you want to preserve
> it for each of them:
The good thing about this part of the approach you suggested is that you can
build regular expressions from strings, too:
keep = '[' + 'a-z' + r'\d' + ']'
>>>> def clean(s, keep):
> ... return "".join(c for c in s if c in keep)
> ...
Why would one prefer this over "".filter(lambda: c in keep, s)?
>>>> clean("<alpha> äöü ::42", keep)
> 'alpha42'
>>>> clean("<alpha> äöü ::42", string.ascii_letters)
> 'alpha'
>
> If you are dealing with a lot of text you can make this a bit more
> efficient with the str.translate() method. Create a mapping that maps all
> characters that you want to keep to themselves
>
>>>> m = str.maketrans(keep, keep)
>>>> m[ord("a")]
> 97
>>>> m[ord(">")]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> KeyError: 62
>
> and all characters that you want to discard to None
Why would creating a *larger* list for *more* operations be *more*
efficient?
--
PointedEars
Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-04-10 17:52 +0200 |
| Message-ID | <mailman.7.1460303563.6211.python-list@python.org> |
| In reply to | #106788 |
Thomas 'PointedEars' Lahn wrote:
> Peter Otten wrote:
>
>> geshdus@gmail.com wrote:
>>> how to write a function taking a string parameter, which returns it
>>> after you delete the spaces, punctuation marks, accented characters in
>>> python ?
>>
>> Looks like you want to remove more characters than you want to keep. In
>> this case I'd decide what characters too keep first, e. g. (assuming
>> Python 3)
>
> However, with *that* approach (which is different from the OP’s request),
> regular expression matching might turn out to be more efficient:
>
> -----------------------------------------------------------
> import re
> print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
> -----------------------------------------------------------
>
> With the OP’s original request, they may still be the better approach.
> For example:
>
> ----------------------------------------------------------------------
> import re
> print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
> flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
>
> or
>
> ----------------------------------------------------------------------
> import re
> print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
> flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
>
>>>>> import string
>>>>> keep = string.ascii_letters + string.digits
>>>>> keep
>> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>>
>> Now you can iterate over the characters and check if you want to preserve
>> it for each of them:
>
> The good thing about this part of the approach you suggested is that you
> can build regular expressions from strings, too:
>
> keep = '[' + 'a-z' + r'\d' + ']'
>
>>>>> def clean(s, keep):
>> ... return "".join(c for c in s if c in keep)
>> ...
>
> Why would one prefer this over "".filter(lambda: c in keep, s)?
Because it's idiomatic Python and easy to understand if you are coming from
the imperative
buf = []
for c in s:
if c in keep:
buf.append(c)
"".join(buf)
Because it uses Python syntax instead of the filter/map/reduce trio.
Because it avoids the extra function call (the lambda) though the speed
difference is not as big as I expected:
$ python3 -m timeit -s 'import string; keep = string.ascii_letters +
string.digits; s = "alphabet soup ä" * 1000' '"".join(filter(lambda c: c in
keep, s))'
100 loops, best of 3: 4.66 msec per loop
$ python3 -m timeit -s 'import string; keep = string.ascii_letters +
string.digits; s = "alphabet soup ä" * 1000' '"".join(c for c in s if c in
keep)'
100 loops, best of 3: 3.11 msec per loop
For reference here is a variant using regular expressions (picked at random,
feel free to find a faster one):
$ python3 -m timeit -s 'import string, re; keep = string.ascii_letters +
string.digits; s = "alphabet soup ä" * 1000; sub=re.compile(r"[^a-zA-
Z0-9]+").sub' 'sub("", s)'
1000 loops, best of 3: 1.65 msec per loop
And finally str.translate():
$ python3 -m timeit -s 'import string, collections as c; keep =
string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000; trans =
c.defaultdict(lambda: None, str.maketrans(keep, keep))' 's.translate(trans)'
1000 loops, best of 3: 997 usec per loop
>>>>> clean("<alpha> äöü ::42", keep)
>> 'alpha42'
>>>>> clean("<alpha> äöü ::42", string.ascii_letters)
>> 'alpha'
>>
>> If you are dealing with a lot of text you can make this a bit more
>> efficient with the str.translate() method. Create a mapping that maps all
>> characters that you want to keep to themselves
>>
>>>>> m = str.maketrans(keep, keep)
>>>>> m[ord("a")]
>> 97
>>>>> m[ord(">")]
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> KeyError: 62
>>
>> and all characters that you want to discard to None
>
> Why would creating a *larger* list for *more* operations be *more*
> efficient?
>
I don't understand the question. If you mean that the trans dict may become
large -- that depends on the input data. The characters to be deleted are
lazily added to the defaultdict. For text in european languages the total
size should stay well below 256 entries. But you are probably aiming at
something else...
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web