Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #71188
| Path | csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <rosuav@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'syntax': 0.04; 'encoding': 0.05; 'say,': 0.05; 'that?': 0.05; 'xml,': 0.05; 'binary': 0.07; 'parser': 0.07; 'subject:file': 0.07; 'suddenly': 0.07; 'utf-8': 0.07; 'string': 0.09; '*is*': 0.09; 'bytes.': 0.09; 'encode': 0.09; 'iterate': 0.09; 'subject:Why': 0.09; 'translations': 0.09; 'wrong,': 0.09; 'cc:addr:python-list': 0.11; 'def': 0.12; 'stored': 0.12; 'wrote': 0.14; 'changes': 0.15; 'backslash': 0.16; 'code?': 0.16; 'effect.': 0.16; 'escapes': 0.16; 'etc?': 0.16; 'expressions.': 0.16; 'file;': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'notation,': 0.16; 'possibly,': 0.16; 'silly': 0.16; 'similarly,': 0.16; 'symbols': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'all,': 0.19; 'trying': 0.19; 'basically': 0.19; 'replacing': 0.19; 'things.': 0.19; 'input': 0.22; 'cc:addr:python.org': 0.22; 'bytes': 0.24; 'directory.': 0.24; 'own.': 0.24; 'passes': 0.24; 'text,': 0.24; 'cc:2**0': 0.24; 'sort': 0.25; 'source': 0.25; 'extension': 0.26; 'long,': 0.26; 'this:': 0.26; '(for': 0.26; 'header:In-Reply- To:1': 0.27; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; 'newer': 0.30; 'message-id:@mail.gmail.com': 0.30; 'work.': 0.31; 'code': 0.31; 'easier': 0.31; 'closer': 0.31; 'directory,': 0.31; 'name;': 0.31; 'file': 0.32; 'thanks!': 0.32; 'regular': 0.32; 'text': 0.33; 'subject:the': 0.34; 'could': 0.34; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'really': 0.36; 'doing': 0.36; 'subject:?': 0.36; 'wrong': 0.37; 'two': 0.37; 'files': 0.38; 'that,': 0.38; 'sure': 0.39; 'enough': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'hope': 0.61; 'helps': 0.61; 'full': 0.61; 'simply': 0.61; 'simple': 0.61; "you're": 0.61; "you'll": 0.62; 'more': 0.64; 'here': 0.66; 'close': 0.67; 'therefore': 0.72; 'to,': 0.72; 'risk': 0.72; 'fail.': 0.84; 'fragment': 0.84; 'literally.': 0.84; 'replacements': 0.84; 'technically': 0.84; 'to:none': 0.92 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=sbFQd7KGJ6RJVLUKTHHDthQJlbsga+zwrhRW928QvfU=; b=dcoYeuLSVhmjAlmNtvMMpWDsLfjfnx2ofP2yHk4jigdj/8+osmRUWXXwOimVUN7mRI sgwqt+ttMe6pdhbuegPnpDurxaeq7XHWHLnobYgh8Ok2LXL7hyKF9dLeaEXrP92xDhfw E3xjQR6ZuqsHVcmEdp/w5kjV7zVMk+6QtTNsCfEpTWPpem/KZcGBRCCuhymJHN2Xq90I Gr1zHKRXyO+RY9vn4OAOntvfxSD4FKQcHHc1eLwTgBc8NZ/eThcm+h8pNgwYZYrJHB6J 3CCId0m1gSc2oi05Mc30daf4W2nqywT3ZZPiI4bp6zUeiUXhcZKaAtaPiXS2R/X0tCO0 ckeQ== |
| MIME-Version | 1.0 |
| X-Received | by 10.52.156.39 with SMTP id wb7mr8595vdb.97.1399666081727; Fri, 09 May 2014 13:08:01 -0700 (PDT) |
| In-Reply-To | <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com> |
| References | <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com> |
| Date | Sat, 10 May 2014 06:08:01 +1000 |
| Subject | Re: Why isn't my re.sub replacing the contents of my MS Word file? |
| From | Chris Angelico <rosuav@gmail.com> |
| Cc | "python-list@python.org" <python-list@python.org> |
| Content-Type | text/plain; charset=UTF-8 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.9829.1399666083.18130.python-list@python.org> (permalink) |
| Lines | 65 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1399666083 news.xs4all.nl 2845 [2001:888:2000:d::a6]:57338 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:71188 |
Show key headers only | View raw
On Sat, May 10, 2014 at 5:51 AM, <scottcabit@gmail.com> wrote: > But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong? > > Thanks! > > fn = 'z:\Documentation\Software' > def processdoc(fn,outfile): > fStr = open(fn, 'rb').read() > re.sub(b'‒','-',fStr) > re.sub(b'–','-',fStr) > re.sub(b'—','-',fStr) > re.sub(b'―','-',fStr) > re.sub(b'⸺','-',fStr) > re.sub(b'⸻','-',fStr) > re.sub(b'-','-',fStr) > re.sub(b'­','-',fStr) I can see several things that might be wrong, but it's hard to say what *is* wrong without trying it. 1) Is the file close enough to text that you can even do this sort of parsing? You say it's an MS Word file; that, unfortunately, could mean a lot of things. Some of the newer formats are basically zipped XML, so translations like this won't work. Other forms of Word document may be closer to text, but you majorly risk corrupting the binary content. 2) How are characters represented? Are they actually stored in the file with ampersands, hashes, etc? Your source strings are all seven bytes long, and will look for exactly those bytes. There must be some form of character encoding used; possibly, instead of the &#x notation, you need to UTF-8 or UTF-16LE encode the characters to look for. 3) You're doing simple string replacements using regular expressions. I don't think any of your symbols here is a metacharacter, but I might be wrong. If you're simply replacing one stream of bytes with another, don't use regex at all, just use string replacement. 4) There's nothing in your current code to actually write the contents anywhere. You do all the changes and then do nothing with it. Or is this just part of the code? 5) Similarly, there's nothing in this fragment that actually calls processdoc(). Did you elide that? The fragment you wrote will do a whole lot of nothing, on its own. 6) There's no file extension on your input file name; be sure you really have the file you want, and not (for instance) a directory. Or if you need to iterate over all the files in a directory, you'll need to do that explicitly. 7) This one isn't technically a problem, but it's a risk. The string 'z:\Documentation\Software' has two backslash escapes \D and \S, which the parser fails to recognize, and therefore passes through literally. So it works, currently. However, if you were to change the path to, say, 'z:\Documentation\backups', then it would suddenly fail. There are several solutions to this: 7a) fn = r'z:\Documentation\Software' 7b) fn = 'z:\\Documentation\\Software' 7c) fn = 'z:/Documentation/Software' Hope that helps some, at least! A more full program would be easier to work with. ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 12:51 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-09 21:03 +0100
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:46 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-10 06:08 +1000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Chase <python.list@tim.thechases.com> - 2014-05-09 15:09 -0500
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:49 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:31 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:12 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-12 10:35 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-12 20:00 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Dave Angel <davea@davea.name> - 2014-05-12 17:15 -0400
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:49 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:55 +1000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-13 12:01 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-13 21:26 +0100
Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-13 23:12 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 13:21 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-14 07:40 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-09 21:22 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-10 00:11 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Golden <mail@timgolden.me.uk> - 2014-05-10 09:49 +0100
csiph-web