Path: csiph.com!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; '(python': 0.05; 'compiler': 0.05; '"as': 0.07; '(self,': 0.07; 'bits': 0.07; 'except:': 0.07; 'matches': 0.07; 'pursuing': 0.07; 'subject:file': 0.07; 'trailing': 0.07; 'true)': 0.07; 'dict': 0.09; 'eat': 0.09; 'lengths': 0.09; 'separately': 0.09; 'statements': 0.09; 'substitution': 0.09; 'targets': 0.09; 'tia,': 0.09; 'python': 0.10; 'output': 0.13; 'def': 0.13; 'ignore': 0.14; 'file,': 0.15; 'importing': 0.15; "('',": 0.16; '(lambda': 0.16; '(reverse': 0.16; '.py': 0.16; 'chunks.': 0.16; 'correctly,': 0.16; 'elsewhere,': 0.16; 'escapes': 0.16; 'false):': 0.16; 'file;': 0.16; 'googling': 0.16; 'hits': 0.16; 'received:74.55.86': 0.16; 'received:74.55.86.74': 0.16; 'received:smtp.webfaction.com': 0.16; 'received:webfaction.com': 0.16; 'subject:Reading': 0.16; 'substitutes': 0.16; 'suggestion.': 0.16; 'token,': 0.16; 'wrote:': 0.16; '<': 0.18; 'debugging': 0.18; 'skip:l 30': 0.18; 'transform': 0.18; 'try:': 0.18; 'working.': 0.18; '>': 0.18; 'input': 0.18; '>>>': 0.20; 'parsing': 0.22; 'text,': 0.22; 'cheers,': 0.22; 'file.': 0.22; 'trying': 0.22; 'am,': 0.23; 'code,': 0.23; 'seems': 0.23; 'originally': 0.23; 'this:': 0.23; "haven't": 0.24; 'header:In- Reply-To:1': 0.24; 'testing': 0.25; "i've": 0.25; 'header:User- Agent:1': 0.26; "doesn't": 0.26; 'sense': 0.26; 'figure': 0.27; 'least': 0.27; 'format,': 0.27; 'idea': 0.28; 'developing': 0.28; 'actual': 0.28; 'fine': 0.28; 'regular': 0.29; '"': 0.29; 'another.': 0.29; 'figured': 0.29; 'occurred': 0.29; 'sensible': 0.29; 'up:': 0.29; 'character': 0.29; 'convert': 0.29; 'raise': 0.29; 'print': 0.30; 'comments': 0.30; 'code': 0.30; 'skip:[ 10': 0.31; 'skip:s 30': 0.31; "can't": 0.32; 'run': 0.33; 'class': 0.33; 'problem': 0.33; 'loading': 0.33; 'editor': 0.34; 'equal': 0.34; 'file': 0.34; 'worked': 0.34; 'skip:d 20': 0.34; 'running': 0.34; 'add': 0.34; 'list': 0.34; 'ones': 0.35; 'text': 0.35; 'done': 0.35; 'mapping': 0.35; 'propose': 0.35; 'something': 0.35; 'item': 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; "wasn't": 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'desired': 0.37; 'thought': 0.37; 'represent': 0.38; 'hi,': 0.38; 'why': 0.39; 'test': 0.39; 'data': 0.39; 'format': 0.39; 'subject:from': 0.39; "didn't": 0.39; 'enough': 0.39; 'australia': 0.61; 'is.': 0.63; 'safe': 0.63; 'limit': 0.65; 'matter.': 0.66; 'note:': 0.66; 'header:Reply-To:1': 0.67; 'incoming': 0.70; 'reply-to:no real name:2**0': 0.71; 'compiles': 0.84; 'complexity': 0.84; 'contemplated': 0.84; 'stubbornly': 0.84; 'western': 0.89; 'abc': 0.91; 'either:': 0.91; 'hills': 0.93 Reply-To: rhills@medimorphosis.com.au Subject: Re: Reading \n unescaped from a file References: <55E65909.2080507@medimorphosis.com.au> <55E8078F.7090502@bluewin.ch> To: python-list@python.org From: Rob Hills X-Enigmail-Draft-Status: N1110 Date: Fri, 4 Sep 2015 00:12:27 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <55E8078F.7090502@bluewin.ch> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 185 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1441296755 news.xs4all.nl 23832 [2001:888:2000:d::a6]:57135 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:95954 Hi Friedrich, On 03/09/15 16:40, Friedrich Rentsch wrote: > > On 09/02/2015 04:03 AM, Rob Hills wrote: >> Hi, >> >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn =3D dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn =3D dataIn.replace('<','<') # Tidy up < character >> dataIn =3D dataIn.replace('>','>') # Tidy up < character >> dataIn =3D dataIn.replace('o','o') # No idea why but lots of= >> these: convert to 'o' character >> dataIn =3D dataIn.replace('f','f') # .. and these: convert t= o >> 'f' character >> dataIn =3D dataIn.replace('e','e') # .. 'e' >> dataIn =3D dataIn.replace('O','O') # .. 'O' >> >> These statements transform my data correctly, but the list of statemen= ts >> grows as I test the data so I thought it made sense to store the >> replacement mappings in a file, read them into a dict and loop through= >> that to do the cleaning up, like this: >> >> with open(fileName, 'r+t', encoding=3D'utf-8') as mapFile: >> for line in mapFile: >> line =3D line.strip() >> try: >> if (line) and not line.startswith('#'): >> line =3D line.split('#')[:1][0].strip() # tri= m >> any trailing comments >> name, value =3D line.split('=3D') >> name =3D name.strip() >> self.filterMap[name]=3Dvalue.strip() >> except: >> self.logger.error('exception occurred parsing >> line [{0}] in file [{1}]'.format(line, fileName)) >> raise >> >> Elsewhere, I use the following code to do the actual cleaning up: >> >> def filter(self, dataIn): >> if dataIn: >> for token, replacement in self.filterMap.items(): >> dataIn =3D dataIn.replace(token, replacement) >> return dataIn >> >> >> My mapping file contents look like this: >> >> \r =3D \\n >> =C3=A2=C2=80=C2=9C =3D " >> < =3D < >> > =3D > >> ' =3D ' >> F =3D F >> o =3D o >> f =3D f >> e =3D e >> O =3D O >> >> This all works "as advertised" */except/* for the '\r' =3D> '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from= >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get= >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >> * change the way I represent '\r' and '\\n' in my mapping file; or >> * transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> >> TIA, >> >> > > I have had this problem too and can propose a solution ready to run > out of my toolbox: > > > class editor: > > def compile (self, replacements): > targets, substitutes =3D zip (*replacements) > re_targets =3D [re.escape (item) for item in targets] > re_targets.sort (reverse =3D True) > self.targets_set =3D set (targets) > self.table =3D dict (replacements) > regex_string =3D '|'.join (re_targets) > self.regex =3D re.compile (regex_string, re.DOTALL) > > def edit (self, text, eat =3D False): > hits =3D self.regex.findall (text) > nohits =3D self.regex.split (text) > valid_hits =3D set (hits) & self.targets_set # Ignore targets > with illegal re modifiers. > if valid_hits: > substitutes =3D [self.table [item] for item in hits if item= > in valid_hits] + [] # Make lengths equal for zip to work right > if eat: > output =3D ''.join (substitutes) > else: > zipped =3D zip (nohits, substitutes) > output =3D ''.join (list (reduce (lambda a, b: a + b, > [zipped][0]))) + nohits [-1] > else: > if eat: > output =3D '' > else: > output =3D input > return output > > >>> substitutions =3D ( > ('\r', '\n'), > ('<', '<'), > ('>', '>'), > ('o', 'o'), > ('f', 'f'), > ('e', 'e'), > ('O', 'O'), > ) > > Order doesn't matter. Add new ones at the end. > > >>> e =3D editor () > >>> e.compile (substitutions) > > A simple way of testing is running the substitutions through the editor= > > >>> print e.edit (repr (substitutions)) > (('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e', > 'e'), ('O', 'O')) > > The escapes need to be tested separately > > >>> print e.edit ('abc\rdef') > abc > def > > Note: This editor's compiler compiles the substitution list to a > regular expression which the editor uses to find all matches in the > text passed to edit. There has got to be a limit to the size of a text > which a regular expression can handle. I don't know what this limit > is. To be on the safe side, edit a large text line by line or at least > in sensible chunks. > > Frederic > Thanks for the suggestion. I had originally done a simple set of hard-coded str.replace() functions which worked fine and are fast enough for me not to have to delve into the complexity and obscurity of regex. I had also contemplated simply declaring my replacement dict in its own =2Epy file and then importing it. I ended up stubbornly pursuing the idea of loading everything from a text file just because I didn't understand why it wasn't working. Cheers, --=20 Rob Hills Waikiki, Western Australia