Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71189

Re: Why isn't my re.sub replacing the contents of my MS Word file?

Date 2014-05-09 15:09 -0500
From Tim Chase <python.list@tim.thechases.com>
Subject Re: Why isn't my re.sub replacing the contents of my MS Word file?
References <ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com>
Newsgroups comp.lang.python
Message-ID <mailman.9830.1399666223.18130.python-list@python.org> (permalink)

Show all headers | View raw


On 2014-05-09 12:51, scottcabit@gmail.com wrote:
>  here is a snippet of code that opens a file (fn contains the
> path\name) and first tried to replace all endash, emdash etc
> characters with simple dash characters, before doing a search. But
> the replaces are not having any effect. Obviously a syntax
> problem....wwhat silly thing am I doing wrong?
> 
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

A Word doc (as your subject mentions) is a binary format.  There's
the older .doc and the newer .docx (which is actually a .zip file
with a particular content-structure renamed to .docx).

Your example doesn't show the extension, so it's hard to tell whether
you're working with the old format or the new format.

That said, a simple replacement *certainly* won't work for a .docx
file, as you'd have to uncompress the contents, open up the various
files inside, perform the replacements, then zip everything back up,
and save the result back out.

For the older .doc file, it's a binary format, so even if you can
successfully find & swap out sequences of 7 chars for a single char,
it might screw up the internal offsets, breaking your file.
Additionally, I vaguely remember sparring with them using some 16-bit
wide characters in .doc files so you might have to search for
atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
character being prefixed with "\x00".

-tkc

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 12:51 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-09 21:03 +0100
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:46 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-10 06:08 +1000
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Chase <python.list@tim.thechases.com> - 2014-05-09 15:09 -0500
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:49 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:31 +0000
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:12 +0000
    Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-12 10:35 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-12 20:00 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Dave Angel <davea@davea.name> - 2014-05-12 17:15 -0400
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:49 +0000
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:55 +1000
        Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-13 12:01 -0700
          Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-13 21:26 +0100
            Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-13 23:12 -0700
              Re: Why isn't my re.sub replacing the contents of my MS Word file? alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 13:21 +0000
            Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-14 07:40 -0700
  Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-09 21:22 -0700
    Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-10 00:11 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Golden <mail@timgolden.me.uk> - 2014-05-10 09:49 +0100

csiph-web