Groups > comp.lang.python > #54948 > unrolled thread

replace only full words

Started by	cerr <ron.eggler@gmail.com>
First post	2013-09-28 09:11 -0700
Last post	2013-09-28 20:37 +0300
Articles	9 — 4 participants

Back to article view | Back to comp.lang.python

  replace only full words cerr <ron.eggler@gmail.com> - 2013-09-28 09:11 -0700
    Re: replace only full words Tim Chase <python.list@tim.thechases.com> - 2013-09-28 11:54 -0500
      Re: replace only full words cerr <ron.eggler@gmail.com> - 2013-09-28 10:43 -0700
        Re: replace only full words MRAB <python@mrabarnett.plus.com> - 2013-09-28 19:07 +0100
          Re: replace only full words cerr <ron.eggler@gmail.com> - 2013-09-28 11:25 -0700
        Re: replace only full words Tim Chase <python.list@tim.thechases.com> - 2013-09-28 13:17 -0500
          Re: replace only full words cerr <ron.eggler@gmail.com> - 2013-09-28 11:25 -0700
    Re: replace only full words MRAB <python@mrabarnett.plus.com> - 2013-09-28 18:00 +0100
      Re: replace only full words Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-09-28 20:37 +0300

#54948 — replace only full words

From	cerr <ron.eggler@gmail.com>
Date	2013-09-28 09:11 -0700
Subject	replace only full words
Message-ID	<bd024ecf-2428-4d6a-bc0c-163112b31842@googlegroups.com>

Hi,

I have a list of sentences and a list of words. Every full word that appears within sentence shall be extended by <WORD> i.e. "I drink in the house." Would become "I <drink> in the <house>." (and not "I <d<rink> in the <house>.")I have attempted it like this:
  for sentence in sentences:
    for noun in nouns:
      if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
	sentence = sentence.replace(noun, '<' + noun + '>')
      
    print(sentence)

but what if The word is in the beginning of a sentence and I also don't like the approach using defined word terminations. Also, is there a way to make it faster?

Thanks

[toc] | [next] | [standalone]

#54956

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-09-28 11:54 -0500
Message-ID	<mailman.422.1380387171.18130.python-list@python.org>
In reply to	#54948

On 2013-09-28 09:11, cerr wrote:
> I have a list of sentences and a list of words. Every full word
> that appears within sentence shall be extended by <WORD> i.e. "I
> drink in the house." Would become "I <drink> in the <house>." (and
> not "I <d<rink> in the <house>.")

This is a good place to reach for regular expressions.  It comes with
a "ensure there is a word-boundary here" token, so you can do
something like the code at the (way) bottom of this email.  I've
pushed it off the bottom in the event you want to try and use regexps
on your own first.  Or if this is homework, at least make you work a
*little* :-)

> Also, is there a way to make it faster?

The code below should do the processing in roughly O(n) time as it
only makes one pass through the data and does O(1) lookups into your
set of nouns.  I included code in the regexp to roughly find
contractions and hyphenated words.  Your original code grows slower
as your list of nouns grows bigger and also suffers from
multiple-replacement issues (if you have the noun-list of ["drink",
"rink"], you'll get results that you don't likely want.

My code hasn't considered case differences, but you should be able to
normalize both the list of nouns and the word you're testing in the
"modify()" function so that it would find "Drink" as well as "drink"

Also, note that some words serve both as nouns and other parts of
speech, e.g. "It's kind of you to house me for the weekend and drink
tea with me."

-tkc

import re

r = re.compile(r"""
  \b    # assert a word boundary
  \w+   # 1+ word characters
  (?:   # a group
   [-']  # a dash or apostrophe
   \w+   # followed by 1+ word characters
   )?    # make the group optional (0 or 1 instances)
  \b    # assert a word boundary here
  """, re.VERBOSE)

nouns = set([
  "drink",
  "house",
  ])

def modify(matchobj):
  word = matchobj.group(0)
  if word in nouns:
    return "<%s>" % word
  else:
    return word

print r.sub(modify, "I drink in the house")

[toc] | [prev] | [next] | [standalone]

#54962

From	cerr <ron.eggler@gmail.com>
Date	2013-09-28 10:43 -0700
Message-ID	<adc7c4b1-b769-465e-9b16-f5c29371c1b0@googlegroups.com>
In reply to	#54956

On Saturday, September 28, 2013 4:54:35 PM UTC, Tim Chase wrote:
> On 2013-09-28 09:11, cerr wrote:
> 
> > I have a list of sentences and a list of words. Every full word
> 
> > that appears within sentence shall be extended by <WORD> i.e. "I
> 
> > drink in the house." Would become "I <drink> in the <house>." (and
> 
> > not "I <d<rink> in the <house>.")
> 
> 
> 
> This is a good place to reach for regular expressions.  It comes with
> 
> a "ensure there is a word-boundary here" token, so you can do
> 
> something like the code at the (way) bottom of this email.  I've
> 
> pushed it off the bottom in the event you want to try and use regexps
> 
> on your own first.  Or if this is homework, at least make you work a
> 
> *little* :-)
> 
> 
> 
> > Also, is there a way to make it faster?
> 
> 
> 
> The code below should do the processing in roughly O(n) time as it
> 
> only makes one pass through the data and does O(1) lookups into your
> 
> set of nouns.  I included code in the regexp to roughly find
> 
> contractions and hyphenated words.  Your original code grows slower
> 
> as your list of nouns grows bigger and also suffers from
> 
> multiple-replacement issues (if you have the noun-list of ["drink",
> 
> "rink"], you'll get results that you don't likely want.
> 
> 
> 
> My code hasn't considered case differences, but you should be able to
> 
> normalize both the list of nouns and the word you're testing in the
> 
> "modify()" function so that it would find "Drink" as well as "drink"
> 
> 
> 
> Also, note that some words serve both as nouns and other parts of
> 
> speech, e.g. "It's kind of you to house me for the weekend and drink
> 
> tea with me."
> 
> 
> 
> -tkc
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> import re
> 
> 
> 
> r = re.compile(r"""
> 
>   \b    # assert a word boundary
> 
>   \w+   # 1+ word characters
> 
>   (?:   # a group
> 
>    [-']  # a dash or apostrophe
> 
>    \w+   # followed by 1+ word characters
> 
>    )?    # make the group optional (0 or 1 instances)
> 
>   \b    # assert a word boundary here
> 
>   """, re.VERBOSE)
> 
> 
> 
> nouns = set([
> 
>   "drink",
> 
>   "house",
> 
>   ])
> 
> 
> 
> def modify(matchobj):
> 
>   word = matchobj.group(0)
> 
>   if word in nouns:
> 
>     return "<%s>" % word
> 
>   else:
> 
>     return word
> 
> 
> 
> print r.sub(modify, "I drink in the house")

Great, only I don't have the re module on my system.... :(

[toc] | [prev] | [next] | [standalone]

#54965

From	MRAB <python@mrabarnett.plus.com>
Date	2013-09-28 19:07 +0100
Message-ID	<mailman.426.1380391624.18130.python-list@python.org>
In reply to	#54962

On 28/09/2013 18:43, cerr wrote:
[snip]
> Great, only I don't have the re module on my system.... :(
>
Really? It's part of Python's standard distribution.

[toc] | [prev] | [next] | [standalone]

#54968

From	cerr <ron.eggler@gmail.com>
Date	2013-09-28 11:25 -0700
Message-ID	<59a2f7d2-0966-4406-83c7-9cf8af563893@googlegroups.com>
In reply to	#54965

On Saturday, September 28, 2013 11:07:11 AM UTC-7, MRAB wrote:
> On 28/09/2013 18:43, cerr wrote:
> 
> [snip]
> 
> > Great, only I don't have the re module on my system.... :(
> 
> >
> 
> Really? It's part of Python's standard distribution.

Oh no, sorry, mis-nformation, i DO have module re available!!! All good!

[toc] | [prev] | [next] | [standalone]

#54966

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-09-28 13:17 -0500
Message-ID	<mailman.427.1380392135.18130.python-list@python.org>
In reply to	#54962

[mercy, you could have trimmed down that reply]

On 2013-09-28 10:43, cerr wrote:
> On Saturday, September 28, 2013 4:54:35 PM UTC, Tim Chase wrote:
>> import re
> 
> Great, only I don't have the re module on my system.... :(

Um, it's a standard Python library.  You sure about that?

  http://docs.python.org/2/library/re.html

-tkc

[toc] | [prev] | [next] | [standalone]

#54969

From	cerr <ron.eggler@gmail.com>
Date	2013-09-28 11:25 -0700
Message-ID	<eed858c9-98fd-4f38-807f-22a6059cc8ec@googlegroups.com>
In reply to	#54966

On Saturday, September 28, 2013 11:17:19 AM UTC-7, Tim Chase wrote:
> [mercy, you could have trimmed down that reply]
> 
> 
> 
> On 2013-09-28 10:43, cerr wrote:
> 
> > On Saturday, September 28, 2013 4:54:35 PM UTC, Tim Chase wrote:
> 
> >> import re
> 
> > 
> 
> > Great, only I don't have the re module on my system.... :(
> 
> 
> 
> Um, it's a standard Python library.  You sure about that?
> 
> 
> 
>   http://docs.python.org/2/library/re.html
> 

Oh no, sorry, mis-nformation, i DO have module re available!!! All good!

[toc] | [prev] | [next] | [standalone]

#54957

From	MRAB <python@mrabarnett.plus.com>
Date	2013-09-28 18:00 +0100
Message-ID	<mailman.423.1380387623.18130.python-list@python.org>
In reply to	#54948

On 28/09/2013 17:11, cerr wrote:
> Hi,
>
> I have a list of sentences and a list of words. Every full word that appears within sentence shall be extended by <WORD> i.e. "I drink in the house." Would become "I <drink> in the <house>." (and not "I <d<rink> in the <house>.")I have attempted it like this:
>    for sentence in sentences:
>      for noun in nouns:
>        if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
> 	sentence = sentence.replace(noun, '<' + noun + '>')
>
>      print(sentence)
>
> but what if The word is in the beginning of a sentence and I also don't like the approach using defined word terminations. Also, is there a way to make it faster?
>
It sounds like a regex problem to me:

import re

nouns = ["drink", "house"]

pattern = re.compile(r"\b(" + "|".join(nouns) + r")\b")

for sentence in sentences:
     sentence = pattern.sub(r"<\g<0>>", sentence)
     print(sentence)

[toc] | [prev] | [next] | [standalone]

#54961

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2013-09-28 20:37 +0300
Message-ID	<qotpprsoniq.fsf@ruuvi.it.helsinki.fi>
In reply to	#54957

MRAB writes:

> On 28/09/2013 17:11, cerr wrote:
> > Hi,
> >
> > I have a list of sentences and a list of words. Every full word
> > that appears within sentence shall be extended by <WORD> i.e. "I
> > drink in the house." Would become "I <drink> in the <house>." (and
> > not "I <d<rink> in the <house>.")I have attempted it like this:
>
> >    for sentence in sentences:
> >      for noun in nouns:
> >        if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
> > 	sentence = sentence.replace(noun, '<' + noun + '>')
> >
> >      print(sentence)
> >
> > but what if The word is in the beginning of a sentence and I also
> > don't like the approach using defined word terminations. Also, is
> > there a way to make it faster?
> >
> It sounds like a regex problem to me:
> 
> import re
> 
> nouns = ["drink", "house"]
> 
> pattern = re.compile(r"\b(" + "|".join(nouns) + r")\b")
> 
> for sentence in sentences:
>      sentence = pattern.sub(r"<\g<0>>", sentence)
>      print(sentence)

Maybe tokenize by a regex and then join the replacements of all
tokens:

import re

def substitute(token):
   if isfullword(token.lower()):
      return '<{}>'.format(token)
   else:
      return token

def tokenize(sentence):
   return re.split(r'(\W)', sentence) 

sentence = 'This is, like, a test.'

tokens = map(substitute, tokenize(sentence))
sentence = ''.join(tokens)

For better results, both tokenization and substitution need to depend
on context. Doing some of that should be an interesting exercise.

[toc] | [prev] | [standalone]

csiph-web

replace only full words

Contents

#54948 — replace only full words

#54956

#54962

#54965

#54968

#54966

#54969

#54957

#54961