Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #71189
| Path | csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python.list@tim.thechases.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.020 |
| X-Spam-Evidence | '*H*': 0.96; '*S*': 0.00; 'syntax': 0.04; 'binary': 0.07; 'subject:file': 0.07; 'snippet': 0.09; 'subject:Why': 0.09; 'def': 0.12; '-tkc': 0.16; '16-bit': 0.16; 'char,': 0.16; 'effect.': 0.16; 'extension,': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'prefixed': 0.16; 'renamed': 0.16; 'silly': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'file,': 0.19; 'example': 0.22; 'format,': 0.24; 'replace': 0.24; 'file.': 0.24; 'header:In-Reply- To:1': 0.27; 'tried': 0.27; 'character': 0.29; "doesn't": 0.30; 'characters': 0.30; 'newer': 0.30; 'said,': 0.30; '(which': 0.31; 'code': 0.31; 'breaking': 0.31; 'doc': 0.31; 'search.': 0.31; 'file': 0.32; 'open': 0.33; 'older': 0.33; 'skip:b 30': 0.33; 'subject:the': 0.34; 'etc': 0.35; 'but': 0.35; 'format.': 0.36; 'doing': 0.36; 'charset:us-ascii': 0.36; 'subject:?': 0.36; 'being': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'even': 0.60; 'up,': 0.60; 'tell': 0.60; 'new': 0.61; 'simple': 0.61; "you're": 0.61; 'first': 0.61; 'back': 0.62; 'save': 0.62; 'email addr:gmail.com': 0.63; 'show': 0.63; 'zip': 0.64; 'here': 0.66; 'subject': 0.69; 'characters,': 0.84; 'contents,': 0.84; 'received:50.22': 0.84; 'opens': 0.91 |
| Date | Fri, 9 May 2014 15:09:58 -0500 |
| From | Tim Chase <python.list@tim.thechases.com> |
| To | python-list@python.org |
| Subject | Re: Why isn't my re.sub replacing the contents of my MS Word file? |
| In-Reply-To | <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com> |
| References | <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com> |
| X-Mailer | Claws Mail 3.8.1 (GTK+ 2.24.10; x86_64-pc-linux-gnu) |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=US-ASCII |
| Content-Transfer-Encoding | 7bit |
| X-AntiAbuse | This header was added to track abuse, please include it with any abuse report |
| X-AntiAbuse | Primary Hostname - boston.accountservergroup.com |
| X-AntiAbuse | Original Domain - python.org |
| X-AntiAbuse | Originator/Caller UID/GID - [47 12] / [47 12] |
| X-AntiAbuse | Sender Address Domain - tim.thechases.com |
| X-Get-Message-Sender-Via | boston.accountservergroup.com: authenticated_id: tim@thechases.com |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.9830.1399666223.18130.python-list@python.org> (permalink) |
| Lines | 42 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1399666223 news.xs4all.nl 2977 [2001:888:2000:d::a6]:32867 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:71189 |
Show key headers only | View raw
On 2014-05-09 12:51, scottcabit@gmail.com wrote: > here is a snippet of code that opens a file (fn contains the > path\name) and first tried to replace all endash, emdash etc > characters with simple dash characters, before doing a search. But > the replaces are not having any effect. Obviously a syntax > problem....wwhat silly thing am I doing wrong? > > fn = 'z:\Documentation\Software' > def processdoc(fn,outfile): > fStr = open(fn, 'rb').read() > re.sub(b'‒','-',fStr) > re.sub(b'–','-',fStr) > re.sub(b'—','-',fStr) > re.sub(b'―','-',fStr) > re.sub(b'⸺','-',fStr) > re.sub(b'⸻','-',fStr) > re.sub(b'-','-',fStr) > re.sub(b'­','-',fStr) A Word doc (as your subject mentions) is a binary format. There's the older .doc and the newer .docx (which is actually a .zip file with a particular content-structure renamed to .docx). Your example doesn't show the extension, so it's hard to tell whether you're working with the old format or the new format. That said, a simple replacement *certainly* won't work for a .docx file, as you'd have to uncompress the contents, open up the various files inside, perform the replacements, then zip everything back up, and save the result back out. For the older .doc file, it's a binary format, so even if you can successfully find & swap out sequences of 7 chars for a single char, it might screw up the internal offsets, breaking your file. Additionally, I vaguely remember sparring with them using some 16-bit wide characters in .doc files so you might have to search for atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each character being prefixed with "\x00". -tkc
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 12:51 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-09 21:03 +0100
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:46 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-10 06:08 +1000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Chase <python.list@tim.thechases.com> - 2014-05-09 15:09 -0500
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:49 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:31 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:12 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-12 10:35 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-12 20:00 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Dave Angel <davea@davea.name> - 2014-05-12 17:15 -0400
Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:49 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:55 +1000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-13 12:01 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-13 21:26 +0100
Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-13 23:12 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 13:21 +0000
Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-14 07:40 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-09 21:22 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-10 00:11 -0700
Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Golden <mail@timgolden.me.uk> - 2014-05-10 09:49 +0100
csiph-web