Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71189

Re: Why isn't my re.sub replacing the contents of my MS Word file?

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python.list@tim.thechases.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.020
X-Spam-Evidence '*H*': 0.96; '*S*': 0.00; 'syntax': 0.04; 'binary': 0.07; 'subject:file': 0.07; 'snippet': 0.09; 'subject:Why': 0.09; 'def': 0.12; '-tkc': 0.16; '16-bit': 0.16; 'char,': 0.16; 'effect.': 0.16; 'extension,': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'prefixed': 0.16; 'renamed': 0.16; 'silly': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'file,': 0.19; 'example': 0.22; 'format,': 0.24; 'replace': 0.24; 'file.': 0.24; 'header:In-Reply- To:1': 0.27; 'tried': 0.27; 'character': 0.29; "doesn't": 0.30; 'characters': 0.30; 'newer': 0.30; 'said,': 0.30; '(which': 0.31; 'code': 0.31; 'breaking': 0.31; 'doc': 0.31; 'search.': 0.31; 'file': 0.32; 'open': 0.33; 'older': 0.33; 'skip:b 30': 0.33; 'subject:the': 0.34; 'etc': 0.35; 'but': 0.35; 'format.': 0.36; 'doing': 0.36; 'charset:us-ascii': 0.36; 'subject:?': 0.36; 'being': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'even': 0.60; 'up,': 0.60; 'tell': 0.60; 'new': 0.61; 'simple': 0.61; "you're": 0.61; 'first': 0.61; 'back': 0.62; 'save': 0.62; 'email addr:gmail.com': 0.63; 'show': 0.63; 'zip': 0.64; 'here': 0.66; 'subject': 0.69; 'characters,': 0.84; 'contents,': 0.84; 'received:50.22': 0.84; 'opens': 0.91
Date Fri, 9 May 2014 15:09:58 -0500
From Tim Chase <python.list@tim.thechases.com>
To python-list@python.org
Subject Re: Why isn't my re.sub replacing the contents of my MS Word file?
In-Reply-To <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com>
References <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com>
X-Mailer Claws Mail 3.8.1 (GTK+ 2.24.10; x86_64-pc-linux-gnu)
Mime-Version 1.0
Content-Type text/plain; charset=US-ASCII
Content-Transfer-Encoding 7bit
X-AntiAbuse This header was added to track abuse, please include it with any abuse report
X-AntiAbuse Primary Hostname - boston.accountservergroup.com
X-AntiAbuse Original Domain - python.org
X-AntiAbuse Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse Sender Address Domain - tim.thechases.com
X-Get-Message-Sender-Via boston.accountservergroup.com: authenticated_id: tim@thechases.com
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.9830.1399666223.18130.python-list@python.org> (permalink)
Lines 42
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1399666223 news.xs4all.nl 2977 [2001:888:2000:d::a6]:32867
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:71189

Show key headers only | View raw


On 2014-05-09 12:51, scottcabit@gmail.com wrote:
>  here is a snippet of code that opens a file (fn contains the
> path\name) and first tried to replace all endash, emdash etc
> characters with simple dash characters, before doing a search. But
> the replaces are not having any effect. Obviously a syntax
> problem....wwhat silly thing am I doing wrong?
> 
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

A Word doc (as your subject mentions) is a binary format.  There's
the older .doc and the newer .docx (which is actually a .zip file
with a particular content-structure renamed to .docx).

Your example doesn't show the extension, so it's hard to tell whether
you're working with the old format or the new format.

That said, a simple replacement *certainly* won't work for a .docx
file, as you'd have to uncompress the contents, open up the various
files inside, perform the replacements, then zip everything back up,
and save the result back out.

For the older .doc file, it's a binary format, so even if you can
successfully find & swap out sequences of 7 chars for a single char,
it might screw up the internal offsets, breaking your file.
Additionally, I vaguely remember sparring with them using some 16-bit
wide characters in .doc files so you might have to search for
atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
character being prefixed with "\x00".

-tkc

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 12:51 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-09 21:03 +0100
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:46 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-10 06:08 +1000
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Chase <python.list@tim.thechases.com> - 2014-05-09 15:09 -0500
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:49 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:31 +0000
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:12 +0000
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-12 10:35 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-12 20:00 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Dave Angel <davea@davea.name> - 2014-05-12 17:15 -0400
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:49 +0000
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:55 +1000
        Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-13 12:01 -0700
          Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-13 21:26 +0100
            Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-13 23:12 -0700
              Re: Why isn't my re.sub replacing the contents of my MS Word file? alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 13:21 +0000
            Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-14 07:40 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-09 21:22 -0700
    Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-10 00:11 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Golden <mail@timgolden.me.uk> - 2014-05-10 09:49 +0100

csiph-web