Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71185 > unrolled thread

Why isn't my re.sub replacing the contents of my MS Word file?

Started byscottcabit@gmail.com
First post2014-05-09 12:51 -0700
Last post2014-05-10 09:49 +0100
Articles 20 on this page of 21 — 10 participants

Back to article view | Back to comp.lang.python


Contents

  Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 12:51 -0700
    Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-09 21:03 +0100
      Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:46 -0700
    Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-10 06:08 +1000
    Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Chase <python.list@tim.thechases.com> - 2014-05-09 15:09 -0500
      Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-09 13:49 -0700
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:31 +0000
    Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-10 00:12 +0000
      Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-12 10:35 -0700
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-12 20:00 -0700
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Dave Angel <davea@davea.name> - 2014-05-12 17:15 -0400
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-13 13:49 +0000
          Re: Why isn't my re.sub replacing the contents of my MS Word file? Chris Angelico <rosuav@gmail.com> - 2014-05-13 23:55 +1000
          Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-13 12:01 -0700
            Re: Why isn't my re.sub replacing the contents of my MS Word file? MRAB <python@mrabarnett.plus.com> - 2014-05-13 21:26 +0100
              Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-13 23:12 -0700
                Re: Why isn't my re.sub replacing the contents of my MS Word file? alister <alister.nospam.ware@ntlworld.com> - 2014-05-14 13:21 +0000
              Re: Why isn't my re.sub replacing the contents of my MS Word file? scottcabit@gmail.com - 2014-05-14 07:40 -0700
    Re: Why isn't my re.sub replacing the contents of my MS Word file? Rustom Mody <rustompmody@gmail.com> - 2014-05-09 21:22 -0700
      Re: Why isn't my re.sub replacing the contents of my MS Word file? wxjmfauth@gmail.com - 2014-05-10 00:11 -0700
        Re: Why isn't my re.sub replacing the contents of my MS Word file? Tim Golden <mail@timgolden.me.uk> - 2014-05-10 09:49 +0100

Page 1 of 2  [1] 2  Next page →


#71185 — Why isn't my re.sub replacing the contents of my MS Word file?

Fromscottcabit@gmail.com
Date2014-05-09 12:51 -0700
SubjectWhy isn't my re.sub replacing the contents of my MS Word file?
Message-ID<ea305e19-be61-469b-8a15-0753406f8476@googlegroups.com>
Hi,

 here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search.
  But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?

  Thanks!

fn = 'z:\Documentation\Software'
def processdoc(fn,outfile):
    fStr = open(fn, 'rb').read()
    re.sub(b'&#x2012','-',fStr)
    re.sub(b'&#x2013','-',fStr)
    re.sub(b'&#x2014','-',fStr)
    re.sub(b'&#x2015','-',fStr)
    re.sub(b'&#x2E3A','-',fStr)
    re.sub(b'&#x2E3B','-',fStr)
    re.sub(b'&#x002D','-',fStr)
    re.sub(b'&#x00AD','-',fStr)

[toc] | [next] | [standalone]


#71187

FromMRAB <python@mrabarnett.plus.com>
Date2014-05-09 21:03 +0100
Message-ID<mailman.9828.1399666008.18130.python-list@python.org>
In reply to#71185
On 2014-05-09 20:51, scottcabit@gmail.com wrote:
> Hi,
>
>   here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search.
>    But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?
>
>    Thanks!
>
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>      fStr = open(fn, 'rb').read()
>      re.sub(b'&#x2012','-',fStr)
>      re.sub(b'&#x2013','-',fStr)
>      re.sub(b'&#x2014','-',fStr)
>      re.sub(b'&#x2015','-',fStr)
>      re.sub(b'&#x2E3A','-',fStr)
>      re.sub(b'&#x2E3B','-',fStr)
>      re.sub(b'&#x002D','-',fStr)
>      re.sub(b'&#x00AD','-',fStr)
>
re.sub _returns_ its result (strings are immutable).

[toc] | [prev] | [next] | [standalone]


#71191

Fromscottcabit@gmail.com
Date2014-05-09 13:46 -0700
Message-ID<8126fa0d-7480-41dd-a4bf-60e2a02ec272@googlegroups.com>
In reply to#71187
> 
> re.sub _returns_ its result (strings are immutable).

  Ahh....so I tried this for each re.sub

  fStr = re.sub(b'&#x2012','-',fStr)

  No errors running it, but it still does nothing.....

[toc] | [prev] | [next] | [standalone]


#71188

FromChris Angelico <rosuav@gmail.com>
Date2014-05-10 06:08 +1000
Message-ID<mailman.9829.1399666083.18130.python-list@python.org>
In reply to#71185
On Sat, May 10, 2014 at 5:51 AM,  <scottcabit@gmail.com> wrote:
>   But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?
>
>   Thanks!
>
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

I can see several things that might be wrong, but it's hard to say
what *is* wrong without trying it.

1) Is the file close enough to text that you can even do this sort of
parsing? You say it's an MS Word file; that, unfortunately, could mean
a lot of things. Some of the newer formats are basically zipped XML,
so translations like this won't work. Other forms of Word document may
be closer to text, but you majorly risk corrupting the binary content.

2) How are characters represented? Are they actually stored in the
file with ampersands, hashes, etc? Your source strings are all seven
bytes long, and will look for exactly those bytes. There must be some
form of character encoding used; possibly, instead of the &#x
notation, you need to UTF-8 or UTF-16LE encode the characters to look
for.

3) You're doing simple string replacements using regular expressions.
I don't think any of your symbols here is a metacharacter, but I might
be wrong. If you're simply replacing one stream of bytes with another,
don't use regex at all, just use string replacement.

4) There's nothing in your current code to actually write the contents
anywhere. You do all the changes and then do nothing with it. Or is
this just part of the code?

5) Similarly, there's nothing in this fragment that actually calls
processdoc(). Did you elide that? The fragment you wrote will do a
whole lot of nothing, on its own.

6) There's no file extension on your input file name; be sure you
really have the file you want, and not (for instance) a directory. Or
if you need to iterate over all the files in a directory, you'll need
to do that explicitly.

7) This one isn't technically a problem, but it's a risk. The string
'z:\Documentation\Software' has two backslash escapes \D and \S, which
the parser fails to recognize, and therefore passes through literally.
So it works, currently. However, if you were to change the path to,
say, 'z:\Documentation\backups', then it would suddenly fail. There
are several solutions to this:
7a) fn = r'z:\Documentation\Software'
7b) fn = 'z:\\Documentation\\Software'
7c) fn = 'z:/Documentation/Software'

Hope that helps some, at least! A more full program would be easier to
work with.

ChrisA

[toc] | [prev] | [next] | [standalone]


#71189

FromTim Chase <python.list@tim.thechases.com>
Date2014-05-09 15:09 -0500
Message-ID<mailman.9830.1399666223.18130.python-list@python.org>
In reply to#71185
On 2014-05-09 12:51, scottcabit@gmail.com wrote:
>  here is a snippet of code that opens a file (fn contains the
> path\name) and first tried to replace all endash, emdash etc
> characters with simple dash characters, before doing a search. But
> the replaces are not having any effect. Obviously a syntax
> problem....wwhat silly thing am I doing wrong?
> 
> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)
>     re.sub(b'&#x2013','-',fStr)
>     re.sub(b'&#x2014','-',fStr)
>     re.sub(b'&#x2015','-',fStr)
>     re.sub(b'&#x2E3A','-',fStr)
>     re.sub(b'&#x2E3B','-',fStr)
>     re.sub(b'&#x002D','-',fStr)
>     re.sub(b'&#x00AD','-',fStr)

A Word doc (as your subject mentions) is a binary format.  There's
the older .doc and the newer .docx (which is actually a .zip file
with a particular content-structure renamed to .docx).

Your example doesn't show the extension, so it's hard to tell whether
you're working with the old format or the new format.

That said, a simple replacement *certainly* won't work for a .docx
file, as you'd have to uncompress the contents, open up the various
files inside, perform the replacements, then zip everything back up,
and save the result back out.

For the older .doc file, it's a binary format, so even if you can
successfully find & swap out sequences of 7 chars for a single char,
it might screw up the internal offsets, breaking your file.
Additionally, I vaguely remember sparring with them using some 16-bit
wide characters in .doc files so you might have to search for
atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
character being prefixed with "\x00".

-tkc

[toc] | [prev] | [next] | [standalone]


#71192

Fromscottcabit@gmail.com
Date2014-05-09 13:49 -0700
Message-ID<e253b9fe-c65f-4df7-b9b2-aaccb14b2e64@googlegroups.com>
In reply to#71189
On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote:

> A Word doc (as your subject mentions) is a binary format.  There's
> the older .doc and the newer .docx (which is actually a .zip file
> with a particular content-structure renamed to .docx).
> 
   I am using .doc files only......

> 
> For the older .doc file, it's a binary format, so even if you can
> successfully find & swap out sequences of 7 chars for a single char,
> it might screw up the internal offsets, breaking your file.

   I do not save the file out again, only try to change all en-dash and em-dash to dashes, then search and print things to another file, closing the searched file without writing it.

> 
> Additionally, I vaguely remember sparring with them using some 16-bit
> wide characters in .doc files so you might have to search for
> atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each
> character being prefixed with "\x00".

  Hmmm..thought that was what I was doing. Can anyone figure out why the syntax is wrong for Word 2007 document binary file data?

[toc] | [prev] | [next] | [standalone]


#71206

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-05-10 00:31 +0000
Message-ID<536d736b$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to#71192
On Fri, 09 May 2014 13:49:56 -0700, scottcabit wrote:

> On Friday, May 9, 2014 4:09:58 PM UTC-4, Tim Chase wrote:
> 
>> A Word doc (as your subject mentions) is a binary format.  There's the
>> older .doc and the newer .docx (which is actually a .zip file with a
>> particular content-structure renamed to .docx).
>> 
>    I am using .doc files only......

Ah, my previous email missed the fact that you are operating on Word docs.

>> For the older .doc file, it's a binary format, so even if you can
>> successfully find & swap out sequences of 7 chars for a single char, it
>> might screw up the internal offsets, breaking your file.
> 
>    I do not save the file out again, only try to change all en-dash and
>    em-dash to dashes, then search and print things to another file,
>    closing the searched file without writing it.
> 
> 
>> Additionally, I vaguely remember sparring with them using some 16-bit
>> wide characters in .doc files so you might have to search for atrocious
>> things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each character
>> being prefixed with "\x00".
> 
>   Hmmm..thought that was what I was doing. Can anyone figure out why the
>   syntax is wrong for Word 2007 document binary file data?

You are searching for the literal "&#x2012", in other words:

    ampersand hash x two zero one two

*not* a FIGURE DASH. Compare:


py> import re
py> source = b'aaaa&#x2012aaaa'
py> print(source)
b'aaaa&#x2012aaaa'
py> re.sub(b'&#x2012', b'Z', source)
b'aaaaZaaaa'

But if the source contains an *actual* FIGURE DASH:

py> source = u'aaaa\u2012aaaa'.encode('utf-8')
py> print(source)
b'aaaa\xe2\x80\x92aaaa'
py> re.sub(b'&#x2012', b'Z', source)
b'aaaa\xe2\x80\x92aaaa'


You're dealing with a binary file format, and I believe it is an 
undocumented binary file format. You don't know which parts of the file 
represent text, metadata, formatting and layout information, or images. 
Even if you identify which parts are text, you don't know what encoding 
is used internally:

py> u'aaaa\u2012aaaa'.encode('utf-8')
b'aaaa\xe2\x80\x92aaaa'
py> u'aaaa\u2012aaaa'.encode('utf-16be')
b'\x00a\x00a\x00a\x00a \x12\x00a\x00a\x00a\x00a'
py> u'aaaa\u2012aaaa'.encode('utf-16le')
b'a\x00a\x00a\x00a\x00\x12 a\x00a\x00a\x00a\x00'

or something else.

You're on *extremely* thin ice here.

If you *must* do this, then you'll need to identify how Word stores 
various dashes in the file. If you're lucky, the textual parts of the doc 
file will be obvious to the eye, so open a few sample files using a hex 
editor and you might be able to identify what Word is using to store the 
various forms of dash.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#71204

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-05-10 00:12 +0000
Message-ID<536d6f08$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to#71185
On Fri, 09 May 2014 12:51:04 -0700, scottcabit wrote:

> Hi,
> 
>  here is a snippet of code that opens a file (fn contains the path\name)
>  and first tried to replace all endash, emdash etc characters with
>  simple dash characters, before doing a search.
>   But the replaces are not having any effect. Obviously a syntax
>   problem....wwhat silly thing am I doing wrong?

You're making the substitution, then throwing the result away.

And you're using a nuclear-powered bulldozer to crack a peanut. This is 
not a job for regexes, this is a job for normal string replacement.

> fn = 'z:\Documentation\Software'
> def processdoc(fn,outfile):
>     fStr = open(fn, 'rb').read()
>     re.sub(b'&#x2012','-',fStr)

Good:

    fStr = re.sub(b'&#x2012', b'-', fStr)

Better:

    fStr = fStr.replace(b'&#x2012', b'-')


But having said that, you actually can make use of the nuclear-powered 
bulldozer, and do all the replacements in one go:

Best:

    # Untested
    fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)


If you're going to unload the power of regexes, unload them on something 
that makes it worthwhile. Replacing a constant, fixed string with another 
constant, fixed string does not require a regex.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]


#71396

Fromscottcabit@gmail.com
Date2014-05-12 10:35 -0700
Message-ID<6caea381-c765-41e7-9135-d5a0d60b7f42@googlegroups.com>
In reply to#71204
On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:

> Good:
> 
> 
> 
>     fStr = re.sub(b'&#x2012', b'-', fStr)
> 

  Doesn't work...the document has been verified to contain endash and emdash characters, but this does NOT replace them.
> 
> 
> Better:
> 
> 
> 
>     fStr = fStr.replace(b'&#x2012', b'-')
> 
> 
   Still doesn't work
> 
> 
> 
> But having said that, you actually can make use of the nuclear-powered 
> 
> bulldozer, and do all the replacements in one go:
> 
> 
> 
> Best:
> 
> 
> 
>     # Untested
> 
>     fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)

  Still doesn't work.

  Guess whatever the code is for endash and mdash are not the ones I am using....

[toc] | [prev] | [next] | [standalone]


#71423

FromRustom Mody <rustompmody@gmail.com>
Date2014-05-12 20:00 -0700
Message-ID<9e710486-eed0-4ae1-a858-895c49881dd8@googlegroups.com>
In reply to#71396
On Monday, May 12, 2014 11:05:53 PM UTC+5:30, scott...@gmail.com wrote:
> On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:
> >     fStr = fStr.replace(b'&#x2012', b'-')
> 
>    Still doesn't work
> 
> 
> > Best:
> > 
> > 
> >     # Untested
> > 
> >     fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)
> 
>   Still doesn't work.
> 
>   Guess whatever the code is for endash and mdash are not the ones I am using....

What happens if you divide two string?
>>> 'a' / 'b'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'str' and 'str'

Or multiply 2 lists?

>>> [1,2]*[3,3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't multiply sequence by non-int of type 'list'

Trying to do a text operation like re.sub on a NON-text object like a doc-file
is the same.

Yes python may not be intelligent enough to give you such useful error messages
outside its territory ie on contents of random files, however logically its the
same -- an impossible operation.


The options you have:
1. Use doc-specific tools eg MS/Libre office to work on doc files ie dont use python
2. Follow Tim Golden's suggestion, ie use win32com which is a doc-talking
python API [BTW Thanks Tim for showing how easy it is]
3. Get out of the doc format to txt (export as plain txt) and then try what you 
are trying on the txt

[toc] | [prev] | [next] | [standalone]


#71443

FromDave Angel <davea@davea.name>
Date2014-05-12 17:15 -0400
Message-ID<mailman.9944.1399965443.18130.python-list@python.org>
In reply to#71396
On 05/12/2014 01:35 PM, scottcabit@gmail.com wrote:
> On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:
>
>> Good:
>>
>>
>>
>>      # Untested
>>
>>      fStr = re.sub(b'&#x(201[2-5])|(2E3[AB])|(00[2A]D)', b'-', fStr)
>
>    Still doesn't work.
>
>    Guess whatever the code is for endash and mdash are not the ones I am using....
>

More likely, your MSWord document isn't a simple text file.  Some 
encodings don't resemble ASCII or Unicode in the least.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#71489

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-05-13 13:49 +0000
Message-ID<537222d8$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to#71396
On Mon, 12 May 2014 10:35:53 -0700, scottcabit wrote:

> On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote:
> 
>> Good:
>> 
>> 
>> 
>>     fStr = re.sub(b'&#x2012', b'-', fStr)
>> 
>> 
>   Doesn't work...the document has been verified to contain endash and
>   emdash characters, but this does NOT replace them.

You may have missed my follow up post, where I said I had not noticed you 
were operating on a binary .doc file.

The text content of your doc file might look like:

   This – is an n-dash.


when viewed in Microsoft Word, but that is not the contents on disk. 
Word .doc files are a proprietary, secret binary format. Apart from the 
rest of the document structure and metadata, the text itself could be 
stored any old way. We don't know how. Microsoft surely knows how it is 
stored, but are unlikely to tell. A few open source projects like 
OpenOffice, LibreOffice and Abiword have reverse-engineered the file 
format. Taking a wild guess, I think it could be something like:

    This \xe2\x80\x93 is an n-dash.

or possibly:

    \x00T\x00h\x00i\x00s\x00  \x13\x00 \x00i\x00s\x00 \x00a
    \x00n\x00 \x00n\x00-\x00d\x00a\x00s\x00h\x00.

or:

    This {EN DASH} is an n-dash.

or:

    x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12
    \xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8


(that last one is the text passed through the zlib compressor), but 
really I'm just making up vaguely conceivable possibilities.

If you're not willing or able to use a full-blown doc parser, say by 
controlling Word or LibreOffice, the other alternative is to do something 
quick and dirty that might work most of the time. Open a doc file, or 
multiple doc files, in a hex editor and *hopefully* you will be able to 
see chunks of human-readable text where you can identify how en-dashes 
and similar are stored.



-- 
Steven D'Aprano

[toc] | [prev] | [next] | [standalone]


#71492

FromChris Angelico <rosuav@gmail.com>
Date2014-05-13 23:55 +1000
Message-ID<mailman.9968.1399989361.18130.python-list@python.org>
In reply to#71489
On Tue, May 13, 2014 at 11:49 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
>     This {EN DASH} is an n-dash.
>
> or:
>
>     x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12
>     \xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8
>
>
> (that last one is the text passed through the zlib compressor)

I had to decompress that just to see what "text" you passed through
zlib, given that zlib is a *byte* compressor :) Turns out it's the
braced notation given above, encoded as ASCII/UTF-8.

ChrisA

[toc] | [prev] | [next] | [standalone]


#71507

Fromscottcabit@gmail.com
Date2014-05-13 12:01 -0700
Message-ID<63051425-ec42-45b4-8a9e-53001625f32a@googlegroups.com>
In reply to#71489
On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote:
> 
> You may have missed my follow up post, where I said I had not noticed you 
> were operating on a binary .doc file.
> 
> If you're not willing or able to use a full-blown doc parser, say by 
> controlling Word or LibreOffice, the other alternative is to do something 
> quick and dirty that might work most of the time. Open a doc file, or 
> multiple doc files, in a hex editor and *hopefully* you will be able to 
> see chunks of human-readable text where you can identify how en-dashes 
> and similar are stored.

  I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What I see is that there are two characters, one for ndash and one for mdash, each a single byte long. 0x96 and 0x97.
  So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)

  that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr)

  which also did nothing.
  So, for fun I also tried to just put these wildcards in my re.findall so I added |Part \0x96|Part \0x97    to no avail.

  Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax.

  So here's my question...if I want to replace all ndash  or mdash values with regular '-' symbols using re.sub, what is the proper syntax to do so?

  Thanks!

[toc] | [prev] | [next] | [standalone]


#71511

FromMRAB <python@mrabarnett.plus.com>
Date2014-05-13 21:26 +0100
Message-ID<mailman.9980.1400012820.18130.python-list@python.org>
In reply to#71507
On 2014-05-13 20:01, scottcabit@gmail.com wrote:
> On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote:
>>
>> You may have missed my follow up post, where I said I had not noticed you
>> were operating on a binary .doc file.
>>
>> If you're not willing or able to use a full-blown doc parser, say by
>> controlling Word or LibreOffice, the other alternative is to do something
>> quick and dirty that might work most of the time. Open a doc file, or
>> multiple doc files, in a hex editor and *hopefully* you will be able to
>> see chunks of human-readable text where you can identify how en-dashes
>> and similar are stored.
>
>    I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What I see is that there are two characters, one for ndash and one for mdash, each a single byte long. 0x96 and 0x97.
>    So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)
>
>    that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr)
>
>    which also did nothing.
>    So, for fun I also tried to just put these wildcards in my re.findall so I added |Part \0x96|Part \0x97    to no avail.
>
>    Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax.
>
>    So here's my question...if I want to replace all ndash  or mdash values with regular '-' symbols using re.sub, what is the proper syntax to do so?
>
>    Thanks!
>
0x96 is a hexadecimal literal for an int. Within a string you need \x96
(it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).

[toc] | [prev] | [next] | [standalone]


#71533

Fromwxjmfauth@gmail.com
Date2014-05-13 23:12 -0700
Message-ID<220e9313-4887-446f-bf30-81972dfe2c2e@googlegroups.com>
In reply to#71511
Le mardi 13 mai 2014 22:26:51 UTC+2, MRAB a écrit :
> On 2014-05-13 20:01, scottcabit@gmail.com wrote:
> 
> > On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote:
> 
> >>
> 
> >> You may have missed my follow up post, where I said I had not noticed you
> 
> >> were operating on a binary .doc file.
> 
> >>
> 
> >> If you're not willing or able to use a full-blown doc parser, say by
> 
> >> controlling Word or LibreOffice, the other alternative is to do something
> 
> >> quick and dirty that might work most of the time. Open a doc file, or
> 
> >> multiple doc files, in a hex editor and *hopefully* you will be able to
> 
> >> see chunks of human-readable text where you can identify how en-dashes
> 
> >> and similar are stored.
> 
> >
> 
> >    I created a .doc file and opened it with UltraEdit in binary (Hex) mode. What I see is that there are two characters, one for ndash and one for mdash, each a single byte long. 0x96 and 0x97.
> 
> >    So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)
> 
> >
> 
> >    that did nothing in my file. So I tried this: fStr = re.sub(b'0x97',b'-',fStr)
> 
> >
> 
> >    which also did nothing.
> 
> >    So, for fun I also tried to just put these wildcards in my re.findall so I added |Part \0x96|Part \0x97    to no avail.
> 
> >
> 
> >    Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax.
> 
> >
> 
> >    So here's my question...if I want to replace all ndash  or mdash values with regular '-' symbols using re.sub, what is the proper syntax to do so?
> 
> >
> 
> >    Thanks!
> 
> >
> 
> 0x96 is a hexadecimal literal for an int. Within a string you need \x96
> 
> (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).


----------------

>>> b'0x61' == b'0x61'
True
>>> b'0x96' == b'\x96'
False


- Python and the coding of characters is an unbelievable
mess.
- Unicode a joke.
- I can make Python failing with any valid sequence of
chars I wish.
- There is a difference between "look, my code work with
my chars" and "this code is safely working with any chars".

jmf

[toc] | [prev] | [next] | [standalone]


#71558

Fromalister <alister.nospam.ware@ntlworld.com>
Date2014-05-14 13:21 +0000
Message-ID<p5Kcv.77902$dT1.7579@fx12.am4>
In reply to#71533
On Tue, 13 May 2014 23:12:40 -0700, wxjmfauth wrote:

> Le mardi 13 mai 2014 22:26:51 UTC+2, MRAB a écrit :
>> On 2014-05-13 20:01, scottcabit@gmail.com wrote:
>> 
>> > On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote:
>> 
>> 
>> >>
>> >> You may have missed my follow up post, where I said I had not
>> >> noticed you
>> 
>> >> were operating on a binary .doc file.
>> 
>> 
>> >>
>> >> If you're not willing or able to use a full-blown doc parser, say by
>> 
>> >> controlling Word or LibreOffice, the other alternative is to do
>> >> something
>> 
>> >> quick and dirty that might work most of the time. Open a doc file,
>> >> or
>> 
>> >> multiple doc files, in a hex editor and *hopefully* you will be able
>> >> to
>> 
>> >> see chunks of human-readable text where you can identify how
>> >> en-dashes
>> 
>> >> and similar are stored.
>> 
>> 
>> >
>> >    I created a .doc file and opened it with UltraEdit in binary (Hex)
>> >    mode. What I see is that there are two characters, one for ndash
>> >    and one for mdash, each a single byte long. 0x96 and 0x97.
>> 
>> >    So I tried this: fStr = re.sub(b'\0x96',b'-',fStr)
>> 
>> 
>> >
>> >    that did nothing in my file. So I tried this: fStr =
>> >    re.sub(b'0x97',b'-',fStr)
>> 
>> 
>> >
>> >    which also did nothing.
>> 
>> >    So, for fun I also tried to just put these wildcards in my
>> >    re.findall so I added |Part \0x96|Part \0x97    to no avail.
>> 
>> 
>> >
>> >    Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall
>> >    or re.sub as hex byte values of 96 and 97 hexadecimal using my
>> >    current syntax.
>> 
>> 
>> >
>> >    So here's my question...if I want to replace all ndash  or mdash
>> >    values with regular '-' symbols using re.sub, what is the proper
>> >    syntax to do so?
>> 
>> 
>> >
>> >    Thanks!
>> 
>> 
>> >
>> 0x96 is a hexadecimal literal for an int. Within a string you need \x96
>> 
>> (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).
> 
> 
> ----------------
> 
>>>> b'0x61' == b'0x61'
> True
>>>> b'0x96' == b'\x96'
> False
> 
> 
> - Python and the coding of characters is an unbelievable mess.
> - Unicode a joke.
> - I can make Python failing with any valid sequence of chars I wish.
> - There is a difference between "look, my code work with my chars" and
> "this code is safely working with any chars".
> 
> jmf


0x96 is not valid ASCII neither is it a valid unicode character in any 
encoding scheme I am familiar with
it is therefore no surprise that python refuses to encode it
it looks like this file is in ANSI - ISO-8859-1

regular expressions are probably overkill fro this issue
loop through the byte array & replace the bytes as needed.

-- 
Under deadline pressure for the next week.  If you want something, it can 
wait.
Unless it's blind screaming paroxysmally hedonistic...

[toc] | [prev] | [next] | [standalone]


#71563

Fromscottcabit@gmail.com
Date2014-05-14 07:40 -0700
Message-ID<c5d1a38a-21d2-40e9-bc63-46c29a3576ad@googlegroups.com>
In reply to#71511
On Tuesday, May 13, 2014 4:26:51 PM UTC-4, MRAB wrote:
> 
> 0x96 is a hexadecimal literal for an int. Within a string you need \x96
> 
> (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).

  Yes, that was my problem. Figured it out just after posting my last message. using \x96 works correctly. Thanks!

[toc] | [prev] | [next] | [standalone]


#71222

FromRustom Mody <rustompmody@gmail.com>
Date2014-05-09 21:22 -0700
Message-ID<9a340516-5720-4341-8089-8bfab978287f@googlegroups.com>
In reply to#71185
On Saturday, May 10, 2014 1:21:04 AM UTC+5:30, scott...@gmail.com wrote:
> Hi,
> 
> 
> 
>  here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search.
> 
>   But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?

If you are using MS-Word use that, not python.

Yeah it is possible to script MS with something like this
http://timgolden.me.uk/pywin32-docs/
[no experience myself!]
but its probably not worth the headache for such a simple job.

The VBA (or whatever is the modern equivalent) will be about as short and simple
as your attempted python and making it work will be far easier.

I way I used to do it with Windows-98 Word. 
Start a macro
Do a simple single search and replace by hand
Close the macro
Edit the macro (VBA version)
Replace the single search-n-replace with all the many you require

[toc] | [prev] | [next] | [standalone]


#71227

Fromwxjmfauth@gmail.com
Date2014-05-10 00:11 -0700
Message-ID<a833e5dd-064b-4f3a-a143-6941ec208abe@googlegroups.com>
In reply to#71222
Le samedi 10 mai 2014 06:22:00 UTC+2, Rustom Mody a écrit :
> On Saturday, May 10, 2014 1:21:04 AM UTC+5:30, scott...@gmail.com wrote:
> 
> > Hi,
> 
> > 
> 
> > 
> 
> > 
> 
> >  here is a snippet of code that opens a file (fn contains the path\name) and first tried to replace all endash, emdash etc characters with simple dash characters, before doing a search.
> 
> > 
> 
> >   But the replaces are not having any effect. Obviously a syntax problem....wwhat silly thing am I doing wrong?
> 
> 
> 
> If you are using MS-Word use that, not python.
> 
> 
> 
> Yeah it is possible to script MS with something like this
> 
> http://timgolden.me.uk/pywin32-docs/
> 
> [no experience myself!]
> 
> but its probably not worth the headache for such a simple job.
> 
> 
> 
> The VBA (or whatever is the modern equivalent) will be about as short and simple
> 
> as your attempted python and making it work will be far easier.
> 
> 
> 
> I way I used to do it with Windows-98 Word. 
> 
> Start a macro
> 
> Do a simple single search and replace by hand
> 
> Close the macro
> 
> Edit the macro (VBA version)
> 
> Replace the single search-n-replace with all the many you require

=========

That's a wise reommendation.

Anyway, as Python may fail as soon as one uses an
EM DASH or an EM DASH, I think it's not worth the
effort to spend to much time with it.

LibreOffice could be a solution.

jmf

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web