Path: csiph.com!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail
Reply-To: rhills@medimorphosis.com.au
Subject: Re: Reading \n unescaped from a file
References: <55E65909.2080507@medimorphosis.com.au> <55E8078F.7090502@bluewin.ch>
To: python-list@python.org
From: Rob Hills <rhills@medimorphosis.com.au>
Date: Fri, 4 Sep 2015 00:12:27 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <55E8078F.7090502@bluewin.ch>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.81.1441296755.8327.python-list@python.org>
Lines: 185
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:95954

Hi Friedrich,

On 03/09/15 16:40, Friedrich Rentsch wrote:
>
> On 09/02/2015 04:03 AM, Rob Hills wrote:
>> Hi,
>>
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>>      dataIn =3D dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>      dataIn =3D dataIn.replace('&lt;','<') # Tidy up < character
>>      dataIn =3D dataIn.replace('&gt;','>') # Tidy up < character
>>      dataIn =3D dataIn.replace('&#111;','o') # No idea why but lots of=

>> these: convert to 'o' character
>>      dataIn =3D dataIn.replace('&#102;','f') # .. and these: convert t=
o
>> 'f' character
>>      dataIn =3D dataIn.replace('&#101;','e') # ..  'e'
>>      dataIn =3D dataIn.replace('&#079;','O') # ..  'O'
>>
>> These statements transform my data correctly, but the list of statemen=
ts
>> grows as I test the data so I thought it made sense to store the
>> replacement mappings in a file, read them into a dict and loop through=

>> that to do the cleaning up, like this:
>>
>>          with open(fileName, 'r+t', encoding=3D'utf-8') as mapFile:
>>              for line in mapFile:
>>                  line =3D line.strip()
>>                  try:
>>                      if (line) and not line.startswith('#'):
>>                          line =3D line.split('#')[:1][0].strip() # tri=
m
>> any trailing comments
>>                          name, value =3D line.split('=3D')
>>                          name =3D name.strip()
>>                          self.filterMap[name]=3Dvalue.strip()
>>                  except:
>>                      self.logger.error('exception occurred parsing
>> line [{0}] in file [{1}]'.format(line, fileName))
>>                      raise
>>
>> Elsewhere, I use the following code to do the actual cleaning up:
>>
>>      def filter(self, dataIn):
>>          if dataIn:
>>              for token, replacement in self.filterMap.items():
>>                  dataIn =3D dataIn.replace(token, replacement)
>>          return dataIn
>>
>>
>> My mapping file contents look like this:
>>
>> \r =3D \\n
>> =C3=A2=C2=80=C2=9C =3D &quot;
>> &lt; =3D <
>> &gt; =3D >
>> &#039; =3D &apos;
>> &#070; =3D F
>> &#111; =3D o
>> &#102; =3D f
>> &#101; =3D e
>> &#079; =3D O
>>
>> This all works "as advertised" */except/* for the '\r' =3D> '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from=

>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get=

>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>>    * change the way I represent '\r' and '\\n' in my mapping file; or
>>    * transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
>> TIA,
>>
>>
>
> I have had this problem too and can propose a solution ready to run
> out of my toolbox:
>
>
> class editor:
>
>     def compile (self, replacements):
>         targets, substitutes =3D zip (*replacements)
>         re_targets =3D [re.escape (item) for item in targets]
>         re_targets.sort (reverse =3D True)
>         self.targets_set =3D set (targets)
>         self.table =3D dict (replacements)
>         regex_string =3D '|'.join (re_targets)
>         self.regex =3D re.compile (regex_string, re.DOTALL)
>
>     def edit (self, text, eat =3D False):
>         hits =3D self.regex.findall (text)
>         nohits =3D self.regex.split (text)
>         valid_hits =3D set (hits) & self.targets_set  # Ignore targets
> with illegal re modifiers.
>         if valid_hits:
>             substitutes =3D [self.table [item] for item in hits if item=

> in valid_hits] + []  # Make lengths equal for zip to work right
>             if eat:
>                 output =3D ''.join (substitutes)
>             else:
>                 zipped =3D zip (nohits, substitutes)
>                 output =3D ''.join (list (reduce (lambda a, b: a + b,
> [zipped][0]))) + nohits [-1]
>         else:
>             if eat:
>                 output =3D ''
>             else:
>                 output =3D input
>         return output
>
> >>> substitutions =3D (
>     ('\r', '\n'),
>     ('&lt;', '<'),
>     ('&gt;', '>'),
>     ('&#111;', 'o'),
>     ('&#102;', 'f'),
>     ('&#101;', 'e'),
>     ('&#079;', 'O'),
>     )
>
> Order doesn't matter. Add new ones at the end.
>
> >>> e =3D editor ()
> >>> e.compile (substitutions)
>
> A simple way of testing is running the substitutions through the editor=

>
> >>> print e.edit (repr (substitutions))
> (('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e',
> 'e'), ('O', 'O'))
>
> The escapes need to be tested separately
>
> >>> print e.edit ('abc\rdef')
> abc
> def
>
> Note: This editor's compiler compiles the substitution list to a
> regular expression which the editor uses to find all matches in the
> text passed to edit. There has got to be a limit to the size of a text
> which a regular expression can handle. I don't know what this limit
> is. To be on the safe side, edit a large text line by line or at least
> in sensible chunks.
>
> Frederic
>

Thanks for the suggestion.  I had originally done a simple set of
hard-coded str.replace() functions which worked fine and are fast enough
for me not to have to delve into the complexity and obscurity of regex.

I had also contemplated simply declaring my replacement dict in its own
=2Epy file and then importing it.

I ended up stubbornly pursuing the idea of loading everything from a
text file just because I didn't understand why it wasn't working.

Cheers,

--=20
Rob Hills
Waikiki, Western Australia