Groups > comp.lang.python > #99700 > unrolled thread

I can't understand re.sub

Started by	Mr Zaug <matthew.herzog@gmail.com>
First post	2015-11-29 13:36 -0800
Last post	2015-12-01 21:31 +0000
Articles	10 — 5 participants

Back to article view | Back to comp.lang.python

  I can't understand re.sub Mr Zaug <matthew.herzog@gmail.com> - 2015-11-29 13:36 -0800
    Re: I can't understand re.sub Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-29 22:01 +0000
      Re: I can't understand re.sub Mr Zaug <matthew.herzog@gmail.com> - 2015-11-29 17:20 -0800
    Re: I can't understand re.sub Rick Johnson <rantingrickjohnson@gmail.com> - 2015-11-29 17:12 -0800
      Re: I can't understand re.sub Mr Zaug <matthew.herzog@gmail.com> - 2015-11-29 17:24 -0800
    Re: I can't understand re.sub Erik <python@lucidity.plus.com> - 2015-11-29 21:53 +0000
      Re: I can't understand re.sub Jussi Piitulainen <harvesting@is.invalid> - 2015-11-30 10:51 +0200
        Re: I can't understand re.sub Erik <python@lucidity.plus.com> - 2015-12-01 01:26 +0000
          Re: I can't understand re.sub Jussi Piitulainen <harvesting@is.invalid> - 2015-12-01 07:28 +0200
            Re: I can't understand re.sub Erik <python@lucidity.plus.com> - 2015-12-01 21:31 +0000

#99700 — I can't understand re.sub

From	Mr Zaug <matthew.herzog@gmail.com>
Date	2015-11-29 13:36 -0800
Subject	I can't understand re.sub
Message-ID	<af27abe4-f81e-4d44-a504-c58d9e71986a@googlegroups.com>

I need to use re.sub to replace strings in a text file. I can't seem to understand how to use the re module to this end.

result = re.sub(pattern, repl, string, count=0, flags=0);

I think I understand that pattern is the regex I'm searching for and repl is the thing I want to substitute for whatever pattern finds but what is string?

The items I'm searching for are few and they do not change. They are "CONTENT_PATH", "ENV" and "NNN". These appear on a few lines in a template file. They do not appear together on any line and they only appear once on each line.

This should be simple, right?

[toc] | [next] | [standalone]

#99702

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-11-29 22:01 +0000
Message-ID	<n3fsju$348$2@dont-email.me>
In reply to	#99700

On Sun, 29 Nov 2015 13:36:57 -0800, Mr Zaug wrote:

> result = re.sub(pattern, repl, string, count=0, flags=0);

re.sub works on a string, not on a file.

Read the file to a string, pass it in as the string.

Or pre-compile the search pattern(s) and process the file line by line:

import re

patts = [
 (re.compile("axe"), "hammer"),
 (re.compile("cat"), "dog"),
 (re.compile("tree"), "fence")
 ]

with open("input.txt","r") as inf, open("output.txt","w") as ouf:
    line = inf.readline()
    for patt in patts:
        line = patt[0].sub(patt[1], line)
    ouf.write(line)

Not tested, but I think it should do the trick.

Or use a single patt and a replacement func:

import re

patt = re.compile("(axe)|(cat)|(tree)")

def replfunc(match):
    if match == 'axe':
        return 'hammer'
    if match == 'cat':
        return 'dog'
    if match == 'tree':
        return 'fence'
    return match

with open("input.txt","r") as inf, open("output.txt","w") as ouf:
    line = inf.readline()
    line = patt.sub(replfunc, line)
    ouf.write(line)

(also not tested)

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#99707

From	Mr Zaug <matthew.herzog@gmail.com>
Date	2015-11-29 17:20 -0800
Message-ID	<58af2723-cd82-4ce5-a6fd-fbe31d4bf692@googlegroups.com>
In reply to	#99702

Thanks. That does help quite a lot.

[toc] | [prev] | [next] | [standalone]

#99706

From	Rick Johnson <rantingrickjohnson@gmail.com>
Date	2015-11-29 17:12 -0800
Message-ID	<feee81b6-2549-4bfa-b741-35da861a0317@googlegroups.com>
In reply to	#99700

On Sunday, November 29, 2015 at 3:37:34 PM UTC-6, Mr Zaug wrote:

> The items I'm searching for are few and they do not change. They are "CONTENT_PATH", "ENV" and "NNN". These appear on a few lines in a template file. They do not appear together on any line and they only appear once on each line. This should be simple, right?

Yes. In fact so simple that string methods and a "for loop" will suffice. Using regexps for this tasks would be like using a dump truck to haul a teaspoon of salt.

[toc] | [prev] | [next] | [standalone]

#99708

From	Mr Zaug <matthew.herzog@gmail.com>
Date	2015-11-29 17:24 -0800
Message-ID	<967ecfa3-b240-44d6-9a75-bbd9f3865da4@googlegroups.com>
In reply to	#99706

On Sunday, November 29, 2015 at 8:12:25 PM UTC-5, Rick Johnson wrote:
> On Sunday, November 29, 2015 at 3:37:34 PM UTC-6, Mr Zaug wrote:
> 
> > The items I'm searching for are few and they do not change. They are "CONTENT_PATH", "ENV" and "NNN". These appear on a few lines in a template file. They do not appear together on any line and they only appear once on each line. This should be simple, right?
> 
> Yes. In fact so simple that string methods and a "for loop" will suffice. Using regexps for this tasks would be like using a dump truck to haul a teaspoon of salt.

I rarely get a chance to do any scripting so yeah, I stink at it.

Ideally I would have a script that will spit out a config file such as 087_pre-prod_snakeoil_farm.any and not need to manually rename said output file.

[toc] | [prev] | [next] | [standalone]

#99728

From	Erik <python@lucidity.plus.com>
Date	2015-11-29 21:53 +0000
Message-ID	<mailman.26.1448872519.14615.python-list@python.org>
In reply to	#99700

On 29/11/15 21:36, Mr Zaug wrote:
> I need to use re.sub to replace strings in a text file.

Do you? Is there any other way?

> result = re.sub(pattern, repl, string, count=0, flags=0);
>
> I think I understand that pattern is the regex I'm searching for and
> repl is the thing I want to substitute for whatever pattern finds but
> what is string?

Where do you think the function gets the string you want to transform from?

> This should be simple, right?

It is. And it could be even simpler if you don't bother with regexes at 
all (if your input is as fixed as you say it is):

 >>> foo = "foo bar baz spam CONTENT_PATH bar spam"
 >>> ' Substitute '.join(foo.split(' CONTENT_PATH ', 1))
'foo bar baz spam Substitute bar spam'
 >>>

E.

[toc] | [prev] | [next] | [standalone]

#99731

From	Jussi Piitulainen <harvesting@is.invalid>
Date	2015-11-30 10:51 +0200
Message-ID	<lf54mg3eupq.fsf@ling.helsinki.fi>
In reply to	#99728

Erik writes:

> On 29/11/15 21:36, Mr Zaug wrote:
>> This should be simple, right?
>
> It is. And it could be even simpler if you don't bother with regexes
> at all (if your input is as fixed as you say it is):
>
> >>> foo = "foo bar baz spam CONTENT_PATH bar spam"
> >>> ' Substitute '.join(foo.split(' CONTENT_PATH ', 1))
> 'foo bar baz spam Substitute bar spam'

Surely the straight thing to say is:

   >>> foo.replace(' CONTENT_PATH ', ' Substitute ')
   'foo bar baz spam Substitute bar spam'

But there was no guarantee of spaces around the target. If you wish to,
say, replace "spam" in your foo with "REDACTED" but leave it intact in
"May the spammer be prosecuted", a regex might be attractive after all.

[toc] | [prev] | [next] | [standalone]

#99762

From	Erik <python@lucidity.plus.com>
Date	2015-12-01 01:26 +0000
Message-ID	<mailman.49.1448933226.14615.python-list@python.org>
In reply to	#99731

On 30/11/15 08:51, Jussi Piitulainen wrote:
> Surely the straight thing to say is:
>
>     >>> foo.replace(' CONTENT_PATH ', ' Substitute ')
>     'foo bar baz spam Substitute bar spam'

Not quite the same thing (but yes, with a third argument of 1, it would be).

> But there was no guarantee of spaces around the target.

I know. It was just an example to show that there might be an option 
that's not a regex for the specific use indicated. It's up to the OP to 
decide whether they think the spaces (or any other, or no, delimiter) 
would actually be required or useful. Or whether they really prefer a 
regex after all.

> If you wish to,
> say, replace "spam" in your foo with "REDACTED" but leave it intact in
> "May the spammer be prosecuted", a regex might be attractive after all.

But that's not what the OP said they wanted to do. They said everything 
was very fixed - they did not want a general purpose human language text 
processing solution ... ;)

E.

[toc] | [prev] | [next] | [standalone]

#99768

From	Jussi Piitulainen <harvesting@is.invalid>
Date	2015-12-01 07:28 +0200
Message-ID	<lf5r3j6ka9q.fsf@ling.helsinki.fi>
In reply to	#99762

Erik writes:
> On 30/11/15 08:51, Jussi Piitulainen wrote:
[- -]
>> If you wish to,
>> say, replace "spam" in your foo with "REDACTED" but leave it intact in
>> "May the spammer be prosecuted", a regex might be attractive after all.
>
> But that's not what the OP said they wanted to do. They said
> everything was very fixed - they did not want a general purpose human
> language text processing solution ... ;)

Language processing is not what I had in mind here. Merely this, that
there is some sort of word boundary, be it punctuation, whitespace, or
an end of the string:

   >>> re.sub(r'\bspam\b', '****', 'spamalot spam')
   'spamalot ****'

That's not perfect either, but it's simple and might be somewhat
proportional to the problem.

A real solution should be aware of the actual structure of those lines,
assuming they follow some defined syntax.

[toc] | [prev] | [next] | [standalone]

#99819

From	Erik <python@lucidity.plus.com>
Date	2015-12-01 21:31 +0000
Message-ID	<mailman.85.1449005519.14615.python-list@python.org>
In reply to	#99768

On 01/12/15 05:28, Jussi Piitulainen wrote:
> A real solution should be aware of the actual structure of those lines,
> assuming they follow some defined syntax.

I think that we are in violent agreement on this ;)

E.

[toc] | [prev] | [standalone]

csiph-web

I can't understand re.sub

Contents

#99700 — I can't understand re.sub

#99702

#99707

#99706

#99708

#99728

#99731

#99762

#99768

#99819