Groups > comp.lang.python > #106778 > unrolled thread

function to remove and punctuation

Started by	geshdus@gmail.com
First post	2016-04-10 04:37 -0700
Last post	2016-04-10 17:52 +0200
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

  function to remove and punctuation geshdus@gmail.com - 2016-04-10 04:37 -0700
    Re: function to remove and punctuation Steven D'Aprano <steve@pearwood.info> - 2016-04-10 22:08 +1000
    Re: function to remove and punctuation Peter Otten <__peter__@web.de> - 2016-04-10 14:35 +0200
      Re: function to remove and punctuation Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2016-04-10 16:23 +0200
        Re: function to remove and punctuation Peter Otten <__peter__@web.de> - 2016-04-10 17:52 +0200

#106778 — function to remove and punctuation

From	geshdus@gmail.com
Date	2016-04-10 04:37 -0700
Subject	function to remove and punctuation
Message-ID	<3af95726-6f5c-4a2d-bf42-061efedd13b1@googlegroups.com>

how to write a function taking a string parameter, which returns it after you delete the spaces, punctuation marks, accented characters in python ?

[toc] | [next] | [standalone]

#106779

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-10 22:08 +1000
Message-ID	<570a425d$0$1612$c3e8da3$5496439d@news.astraweb.com>
In reply to	#106778

On Sun, 10 Apr 2016 09:37 pm, geshdus@gmail.com wrote:

> how to write a function taking a string parameter, which returns it after
> you delete the spaces, punctuation marks, accented characters in python ?

In your text editor, open a new file.

Now bash your fingers onto the keyboard so that letters appear in the file. 

When you have the function, click Save.

(Sorry, I couldn't resist.)

Here is one to get you started:

def remove_punctuation(the_string):
    the_string = the_string.replace(".", "")
    return the_string

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#106780

From	Peter Otten <__peter__@web.de>
Date	2016-04-10 14:35 +0200
Message-ID	<mailman.2.1460291770.6211.python-list@python.org>
In reply to	#106778

geshdus@gmail.com wrote:

> how to write a function taking a string parameter, which returns it after
> you delete the spaces, punctuation marks, accented characters in python ?

Looks like you want to remove more characters than you want to keep. In this 
case I'd decide what characters too keep first, e. g. (assuming Python 3)

>>> import string
>>> keep = string.ascii_letters + string.digits
>>> keep
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

Now you can iterate over the characters and check if you want to preserve it 
for each of them:

>>> def clean(s, keep):
...     return "".join(c for c in s if c in keep)
... 
>>> clean("<alpha> äöü ::42", keep)
'alpha42'
>>> clean("<alpha> äöü ::42", string.ascii_letters)
'alpha'

If you are dealing with a lot of text you can make this a bit more efficient 
with the str.translate() method. Create a mapping that maps all characters 
that you want to keep to themselves

>>> m = str.maketrans(keep, keep)
>>> m[ord("a")]
97
>>> m[ord(">")]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 62

and all characters that you want to discard to None

>>> from collections import defaultdict
>>> trans = defaultdict(lambda: None, m)
>>> trans[ord("s")]
115
>>> trans[ord("ß")] # returns None, so nothing is printed
>>> 

Now pass it to the translate() method:

>>> "<alpha> äöü ::42".translate(trans)
'alpha42'

You changed your mind and want to translate " " to "_"? Here's how:
>>> trans[ord(" ")] = "_"
>>> "<alpha> äöü ::42".translate(trans)
'alpha__42'

>>> trans[ord(" ")] = "_"
>>> "<alpha> äöü ::42".translate(trans)

[toc] | [prev] | [next] | [standalone]

#106788

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2016-04-10 16:23 +0200
Message-ID	<4030166.z2W5J3t6Yo@PointedEars.de>
In reply to	#106780

Peter Otten wrote:

> geshdus@gmail.com wrote:
>> how to write a function taking a string parameter, which returns it after
>> you delete the spaces, punctuation marks, accented characters in python ?
> 
> Looks like you want to remove more characters than you want to keep. In
> this case I'd decide what characters too keep first, e. g. (assuming
> Python 3)

However, with *that* approach (which is different from the OP’s request), 
regular expression matching might turn out to be more efficient:

-----------------------------------------------------------
import re
print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
-----------------------------------------------------------

With the OP’s original request, they may still be the better approach.
For example:

----------------------------------------------------------------------
import re
print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", 
                     flags=re.IGNORECASE)))
----------------------------------------------------------------------

or

----------------------------------------------------------------------
import re
print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...", 
                         flags=re.IGNORECASE)))
----------------------------------------------------------------------

>>>> import string
>>>> keep = string.ascii_letters + string.digits
>>>> keep
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
> 
> Now you can iterate over the characters and check if you want to preserve
> it for each of them:

The good thing about this part of the approach you suggested is that you can 
build regular expressions from strings, too:

  keep = '[' + 'a-z' + r'\d' + ']'
 
>>>> def clean(s, keep):
> ...     return "".join(c for c in s if c in keep)
> ...

Why would one prefer this over "".filter(lambda: c in keep, s)?

>>>> clean("<alpha> äöü ::42", keep)
> 'alpha42'
>>>> clean("<alpha> äöü ::42", string.ascii_letters)
> 'alpha'
> 
> If you are dealing with a lot of text you can make this a bit more
> efficient with the str.translate() method. Create a mapping that maps all
> characters that you want to keep to themselves
> 
>>>> m = str.maketrans(keep, keep)
>>>> m[ord("a")]
> 97
>>>> m[ord(">")]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> KeyError: 62
> 
> and all characters that you want to discard to None

Why would creating a *larger* list for *more* operations be *more* 
efficient? 

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#106789

From	Peter Otten <__peter__@web.de>
Date	2016-04-10 17:52 +0200
Message-ID	<mailman.7.1460303563.6211.python-list@python.org>
In reply to	#106788

Thomas 'PointedEars' Lahn wrote:

> Peter Otten wrote:
> 
>> geshdus@gmail.com wrote:
>>> how to write a function taking a string parameter, which returns it
>>> after you delete the spaces, punctuation marks, accented characters in
>>> python ?
>> 
>> Looks like you want to remove more characters than you want to keep. In
>> this case I'd decide what characters too keep first, e. g. (assuming
>> Python 3)
> 
> However, with *that* approach (which is different from the OP’s request),
> regular expression matching might turn out to be more efficient:
> 
> -----------------------------------------------------------
> import re
> print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
> -----------------------------------------------------------
> 
> With the OP’s original request, they may still be the better approach.
> For example:
> 
> ----------------------------------------------------------------------
> import re
> print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
>                      flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
> 
> or
> 
> ----------------------------------------------------------------------
> import re
> print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
>                          flags=re.IGNORECASE)))
> ----------------------------------------------------------------------
> 
>>>>> import string
>>>>> keep = string.ascii_letters + string.digits
>>>>> keep
>> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>> 
>> Now you can iterate over the characters and check if you want to preserve
>> it for each of them:
> 
> The good thing about this part of the approach you suggested is that you
> can build regular expressions from strings, too:
> 
>   keep = '[' + 'a-z' + r'\d' + ']'
>  
>>>>> def clean(s, keep):
>> ...     return "".join(c for c in s if c in keep)
>> ...
> 
> Why would one prefer this over "".filter(lambda: c in keep, s)?

Because it's idiomatic Python and easy to understand if you are coming from 
the imperative

buf = []
for c in s:
    if c in keep:
        buf.append(c)
"".join(buf)

Because it uses Python syntax instead of the filter/map/reduce trio.

Because it avoids the extra function call (the lambda) though the speed 
difference is not as big as I expected:

$ python3 -m timeit -s 'import string; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000' '"".join(filter(lambda c: c in 
keep, s))'
100 loops, best of 3: 4.66 msec per loop

$ python3 -m timeit -s 'import string; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000' '"".join(c for c in s if c in 
keep)'
100 loops, best of 3: 3.11 msec per loop

For reference here is a variant using regular expressions (picked at random, 
feel free to find a faster one):

$ python3 -m timeit -s 'import string, re; keep = string.ascii_letters + 
string.digits; s = "alphabet soup ä" * 1000; sub=re.compile(r"[^a-zA-
Z0-9]+").sub' 'sub("", s)'
1000 loops, best of 3: 1.65 msec per loop

And finally str.translate():

$ python3 -m timeit -s 'import string, collections as c; keep = 
string.ascii_letters + string.digits; s = "alphabet soup ä" * 1000; trans = 
c.defaultdict(lambda: None, str.maketrans(keep, keep))' 's.translate(trans)'
1000 loops, best of 3: 997 usec per loop

>>>>> clean("<alpha> äöü ::42", keep)
>> 'alpha42'
>>>>> clean("<alpha> äöü ::42", string.ascii_letters)
>> 'alpha'
>> 
>> If you are dealing with a lot of text you can make this a bit more
>> efficient with the str.translate() method. Create a mapping that maps all
>> characters that you want to keep to themselves
>> 
>>>>> m = str.maketrans(keep, keep)
>>>>> m[ord("a")]
>> 97
>>>>> m[ord(">")]
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> KeyError: 62
>> 
>> and all characters that you want to discard to None
> 
> Why would creating a *larger* list for *more* operations be *more*
> efficient?
> 

I don't understand the question. If you mean that the trans dict may become 
large -- that depends on the input data. The characters to be deleted are 
lazily added to the defaultdict. For text in european languages the total 
size should stay well below 256 entries. But you are probably aiming at 
something else...

[toc] | [prev] | [standalone]

csiph-web

function to remove and punctuation

Contents

#106778 — function to remove and punctuation

#106779

#106780

#106788

#106789