Path: csiph.com!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(except': 0.05; '(python': 0.05; 'mrab': 0.05; '"as': 0.07; 'bits': 0.07; 'except:': 0.07; 'subject:file': 0.07; 'trailing': 0.07; 'ast': 0.09; 'dict': 0.09; 'literal': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'statements': 0.09; 'python': 0.10; 'output': 0.13; 'def': 0.13; 'file,': 0.15; 'correctly,': 0.16; 'elsewhere,': 0.16; 'example).': 0.16; 'file;': 0.16; 'first:': 0.16; 'googling': 0.16; 'one)': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'recognised': 0.16; 'subject:Reading': 0.16; 'token,': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'string': 0.17; '<': 0.18; 'debugging': 0.18; 'skip:l 30': 0.18; 'transform': 0.18; 'try:': 0.18; '>': 0.18; '>>>': 0.20; 'parsing': 0.22; 'file.': 0.22; 'trying': 0.22; 'code,': 0.23; 'seems': 0.23; 'this:': 0.23; "haven't": 0.24; 'second': 0.24; 'import': 0.24; "i've": 0.25; 'header:User- Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; 'sense': 0.26; 'figure': 0.27; 'format,': 0.27; 'idea': 0.28; 'developing': 0.28; 'actual': 0.28; '"': 0.29; 'another.': 0.29; 'figured': 0.29; 'occurred': 0.29; 'up:': 0.29; 'character': 0.29; 'convert': 0.29; 'raise': 0.29; 'comments': 0.30; 'code': 0.30; 'skip:[ 10': 0.31; 'skip:s 30': 0.31; "can't": 0.32; 'problem': 0.33; 'changing': 0.34; 'file': 0.34; 'skip:d 20': 0.34; 'list': 0.34; 'text': 0.35; 'done': 0.35; 'mapping': 0.35; 'something': 0.35; 'but': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'desired': 0.37; 'thought': 0.37; 'represent': 0.38; 'hi,': 0.38; 'why': 0.39; 'test': 0.39; 'data': 0.39; 'format': 0.39; 'subject:from': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'different': 0.63; 'incoming': 0.70; 'old,': 0.83; 'replacements': 0.84; 'either:': 0.91; 'hills': 0.93 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Peter Otten <__peter__@web.de> Subject: Re: Reading \n unescaped from a file Date: Thu, 03 Sep 2015 00:54:12 +0200 Organization: None References: <55E65909.2080507@medimorphosis.com.au> <55E778C7.7050802@mrabarnett.plus.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8Bit X-Gmane-NNTP-Posting-Host: p57bd98c2.dip0.t-ipconnect.de User-Agent: KNode/4.13.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 122 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1441234465 news.xs4all.nl 23849 [2001:888:2000:d::a6]:42394 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:95906 MRAB wrote: > On 2015-09-02 03:03, Rob Hills wrote: >> Hi, >> >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn = dataIn.replace('<','<') # Tidy up < character >> dataIn = dataIn.replace('>','>') # Tidy up < character >> dataIn = dataIn.replace('o','o') # No idea why but lots of >> these: convert to 'o' character dataIn = >> dataIn.replace('f','f') # .. and these: convert to 'f' >> character >> dataIn = dataIn.replace('e','e') # .. 'e' >> dataIn = dataIn.replace('O','O') # .. 'O' >> > The problem with this approach is that the order of the replacements > matters. For example, changing '<' to '<' and then '&' to '&' > can give a different result to changing '&' to '&' and then '<' > to '<'. If you started with the string '&lt;', then the first order > would go '&lt;' => '&lt;' => '<', whereas the second order > would go '&lt;' => '<' => '<'. > >> These statements transform my data correctly, but the list of statements >> grows as I test the data so I thought it made sense to store the >> replacement mappings in a file, read them into a dict and loop through >> that to do the cleaning up, like this: >> >> with open(fileName, 'r+t', encoding='utf-8') as mapFile: >> for line in mapFile: >> line = line.strip() >> try: >> if (line) and not line.startswith('#'): >> line = line.split('#')[:1][0].strip() # trim any >> trailing comments name, value = line.split('=') >> name = name.strip() >> self.filterMap[name]=value.strip() >> except: >> self.logger.error('exception occurred parsing line >> [{0}] in file [{1}]'.format(line, fileName)) raise >> >> Elsewhere, I use the following code to do the actual cleaning up: >> >> def filter(self, dataIn): >> if dataIn: >> for token, replacement in self.filterMap.items(): >> dataIn = dataIn.replace(token, replacement) >> return dataIn >> >> >> My mapping file contents look like this: >> >> \r = \\n >> “ = " >> < = < >> > = > >> ' = ' >> F = F >> o = o >> f = f >> e = e >> O = O >> >> This all works "as advertised" */except/* for the '\r' => '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >> * change the way I represent '\r' and '\\n' in my mapping file; or >> * transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> > Try ast.literal_eval, although you'd need to make it look like a string > literal first: > > >>> import ast > >>> line = r'\r = \\n' > >>> print(line) > \r = \\n > >>> old, sep, new = line.partition(' = ') > >>> print(old) > \r > >>> print(new) > \\n > >>> ast.literal_eval('"%s"' % old) > '\r' > >>> ast.literal_eval('"%s"' % new) > '\\n' > >>> There's also codecs.decode(): >>> codecs.decode(r"\r = \\n", "unicode-escape") '\r = \\n' > I wouldn't put the &#...; forms into the mappings file (except for the > ' one) because they can all be recognised and done in code > ('F' is chr(int('070')), for example). Or >>> import html >>> html.unescape("< ö F") '< ö F' Even if you cannot use unescape() directly you might steal the implementation.