Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #95906
| Path | csiph.com!eternal-september.org!feeder.eternal-september.org!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail |
|---|---|
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; '(except': 0.05; '(python': 0.05; 'mrab': 0.05; '"as': 0.07; 'bits': 0.07; 'except:': 0.07; 'subject:file': 0.07; 'trailing': 0.07; 'ast': 0.09; 'dict': 0.09; 'literal': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'statements': 0.09; 'python': 0.10; 'output': 0.13; 'def': 0.13; 'file,': 0.15; 'correctly,': 0.16; 'elsewhere,': 0.16; 'example).': 0.16; 'file;': 0.16; 'first:': 0.16; 'googling': 0.16; 'one)': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'recognised': 0.16; 'subject:Reading': 0.16; 'token,': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'string': 0.17; '<': 0.18; 'debugging': 0.18; 'skip:l 30': 0.18; 'transform': 0.18; 'try:': 0.18; '>': 0.18; '>>>': 0.20; 'parsing': 0.22; 'file.': 0.22; 'trying': 0.22; 'code,': 0.23; 'seems': 0.23; 'this:': 0.23; "haven't": 0.24; 'second': 0.24; 'import': 0.24; "i've": 0.25; 'header:User- Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; 'sense': 0.26; 'figure': 0.27; 'format,': 0.27; 'idea': 0.28; 'developing': 0.28; 'actual': 0.28; '"': 0.29; 'another.': 0.29; 'figured': 0.29; 'occurred': 0.29; 'up:': 0.29; 'character': 0.29; 'convert': 0.29; 'raise': 0.29; 'comments': 0.30; 'code': 0.30; 'skip:[ 10': 0.31; 'skip:s 30': 0.31; "can't": 0.32; 'problem': 0.33; 'changing': 0.34; 'file': 0.34; 'skip:d 20': 0.34; 'list': 0.34; 'text': 0.35; 'done': 0.35; 'mapping': 0.35; 'something': 0.35; 'but': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'desired': 0.37; 'thought': 0.37; 'represent': 0.38; 'hi,': 0.38; 'why': 0.39; 'test': 0.39; 'data': 0.39; 'format': 0.39; 'subject:from': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'different': 0.63; 'incoming': 0.70; 'old,': 0.83; 'replacements': 0.84; 'either:': 0.91; 'hills': 0.93 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| To | python-list@python.org |
| From | Peter Otten <__peter__@web.de> |
| Subject | Re: Reading \n unescaped from a file |
| Date | Thu, 03 Sep 2015 00:54:12 +0200 |
| Organization | None |
| References | <55E65909.2080507@medimorphosis.com.au> <55E778C7.7050802@mrabarnett.plus.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="UTF-8" |
| Content-Transfer-Encoding | 8Bit |
| X-Gmane-NNTP-Posting-Host | p57bd98c2.dip0.t-ipconnect.de |
| User-Agent | KNode/4.13.3 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.43.1441234465.8327.python-list@python.org> (permalink) |
| Lines | 122 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1441234465 news.xs4all.nl 23849 [2001:888:2000:d::a6]:42394 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:95906 |
Show key headers only | View raw
MRAB wrote:
> On 2015-09-02 03:03, Rob Hills wrote:
>> Hi,
>>
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>> dataIn = dataIn.replace('<','<') # Tidy up < character
>> dataIn = dataIn.replace('>','>') # Tidy up < character
>> dataIn = dataIn.replace('o','o') # No idea why but lots of
>> these: convert to 'o' character dataIn =
>> dataIn.replace('f','f') # .. and these: convert to 'f'
>> character
>> dataIn = dataIn.replace('e','e') # .. 'e'
>> dataIn = dataIn.replace('O','O') # .. 'O'
>>
> The problem with this approach is that the order of the replacements
> matters. For example, changing '<' to '<' and then '&' to '&'
> can give a different result to changing '&' to '&' and then '<'
> to '<'. If you started with the string '&lt;', then the first order
> would go '&lt;' => '&lt;' => '<', whereas the second order
> would go '&lt;' => '<' => '<'.
>
>> These statements transform my data correctly, but the list of statements
>> grows as I test the data so I thought it made sense to store the
>> replacement mappings in a file, read them into a dict and loop through
>> that to do the cleaning up, like this:
>>
>> with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>> for line in mapFile:
>> line = line.strip()
>> try:
>> if (line) and not line.startswith('#'):
>> line = line.split('#')[:1][0].strip() # trim any
>> trailing comments name, value = line.split('=')
>> name = name.strip()
>> self.filterMap[name]=value.strip()
>> except:
>> self.logger.error('exception occurred parsing line
>> [{0}] in file [{1}]'.format(line, fileName)) raise
>>
>> Elsewhere, I use the following code to do the actual cleaning up:
>>
>> def filter(self, dataIn):
>> if dataIn:
>> for token, replacement in self.filterMap.items():
>> dataIn = dataIn.replace(token, replacement)
>> return dataIn
>>
>>
>> My mapping file contents look like this:
>>
>> \r = \\n
>> â = "
>> < = <
>> > = >
>> ' = '
>> F = F
>> o = o
>> f = f
>> e = e
>> O = O
>>
>> This all works "as advertised" */except/* for the '\r' => '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get
>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>> * change the way I represent '\r' and '\\n' in my mapping file; or
>> * transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
> Try ast.literal_eval, although you'd need to make it look like a string
> literal first:
>
> >>> import ast
> >>> line = r'\r = \\n'
> >>> print(line)
> \r = \\n
> >>> old, sep, new = line.partition(' = ')
> >>> print(old)
> \r
> >>> print(new)
> \\n
> >>> ast.literal_eval('"%s"' % old)
> '\r'
> >>> ast.literal_eval('"%s"' % new)
> '\\n'
> >>>
There's also codecs.decode():
>>> codecs.decode(r"\r = \\n", "unicode-escape")
'\r = \\n'
> I wouldn't put the &#...; forms into the mappings file (except for the
> ' one) because they can all be recognised and done in code
> ('F' is chr(int('070')), for example).
Or
>>> import html
>>> html.unescape("< ö F")
'< ö F'
Even if you cannot use unescape() directly you might steal the
implementation.
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Reading \n unescaped from a file Peter Otten <__peter__@web.de> - 2015-09-03 00:54 +0200
csiph-web