Path: csiph.com!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: Re: Reading \n unescaped from a file
Date: Thu, 03 Sep 2015 00:54:12 +0200
Organization: None
References: <55E65909.2080507@medimorphosis.com.au> <55E778C7.7050802@mrabarnett.plus.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8Bit
User-Agent: KNode/4.13.3
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.43.1441234465.8327.python-list@python.org>
Lines: 122
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:95906

MRAB wrote:

> On 2015-09-02 03:03, Rob Hills wrote:
>> Hi,
>>
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>>      dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>      dataIn = dataIn.replace('&lt;','<') # Tidy up < character
>>      dataIn = dataIn.replace('&gt;','>') # Tidy up < character
>>      dataIn = dataIn.replace('&#111;','o') # No idea why but lots of
>>      these: convert to 'o' character dataIn =
>>      dataIn.replace('&#102;','f') # .. and these: convert to 'f'
>>      character
>>      dataIn = dataIn.replace('&#101;','e') # ..  'e'
>>      dataIn = dataIn.replace('&#079;','O') # ..  'O'
>>
> The problem with this approach is that the order of the replacements
> matters. For example, changing '&lt;' to '<' and then '&amp;' to '&'
> can give a different result to changing '&amp;' to '&' and then '&lt;'
> to '<'. If you started with the string '&amp;lt;', then the first order
> would go '&amp;lt;' => '&amp;lt;' => '&lt;', whereas the second order
> would go '&amp;lt;' => '&lt;' => '<'.
> 
>> These statements transform my data correctly, but the list of statements
>> grows as I test the data so I thought it made sense to store the
>> replacement mappings in a file, read them into a dict and loop through
>> that to do the cleaning up, like this:
>>
>>          with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>>              for line in mapFile:
>>                  line = line.strip()
>>                  try:
>>                      if (line) and not line.startswith('#'):
>>                          line = line.split('#')[:1][0].strip() # trim any
>>                          trailing comments name, value = line.split('=')
>>                          name = name.strip()
>>                          self.filterMap[name]=value.strip()
>>                  except:
>>                      self.logger.error('exception occurred parsing line
>>                      [{0}] in file [{1}]'.format(line, fileName)) raise
>>
>> Elsewhere, I use the following code to do the actual cleaning up:
>>
>>      def filter(self, dataIn):
>>          if dataIn:
>>              for token, replacement in self.filterMap.items():
>>                  dataIn = dataIn.replace(token, replacement)
>>          return dataIn
>>
>>
>> My mapping file contents look like this:
>>
>> \r = \\n
>> â = &quot;
>> &lt; = <
>> &gt; = >
>> &#039; = &apos;
>> &#070; = F
>> &#111; = o
>> &#102; = f
>> &#101; = e
>> &#079; = O
>>
>> This all works "as advertised" */except/* for the '\r' => '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get
>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>>   * change the way I represent '\r' and '\\n' in my mapping file; or
>>   * transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
> Try ast.literal_eval, although you'd need to make it look like a string
> literal first:
> 
>  >>> import ast
>  >>> line = r'\r = \\n'
>  >>> print(line)
> \r = \\n
>  >>> old, sep, new = line.partition(' = ')
>  >>> print(old)
> \r
>  >>> print(new)
> \\n
>  >>> ast.literal_eval('"%s"' % old)
> '\r'
>  >>> ast.literal_eval('"%s"' % new)
> '\\n'
>  >>>

There's also codecs.decode():

>>> codecs.decode(r"\r = \\n", "unicode-escape")
'\r = \\n'


> I wouldn't put the &#...; forms into the mappings file (except for the
> &#039; one) because they can all be recognised and done in code
> ('&#070;' is chr(int('070')), for example).

Or 

>>> import html
>>> html.unescape("&lt; &ouml; &#070;")
'< ö F'

Even if you cannot use unescape() directly you might steal the 
implementation.