Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.postscript > #3509
| From | ken <ken@spamcop.net> |
|---|---|
| Newsgroups | comp.lang.postscript |
| Subject | Re: [mutool] Save images as independent files + manage paragraphs? |
| Date | 2020-04-24 09:13 +0100 |
| Message-ID | <MPG.390caff3cf3edb419898b0@usenet.plus.net> (permalink) |
| References | <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com> |
In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, frdtheman@gmail.com says... > According to Artifex*, this newsgroup is one of the ways to ask questions. It is, but essentially for Ghostscript (which is a PostScript interpreter) rather than MuPDF. You may find you get answers more quickly (and indeed better informed ones) by using IRC and joining the #mupdf channel on freenode.net > By default*, "mutool draw" saves pictures within the HTML files as base64, and breaks paragraphs into indepdent lines with <p>?</p>. > > mutool draw -F html -o out.%d.html in.pdf > > I was wondering if there were a way to? > 1. Have it keep paragraphs together OK you may need to do some more research on the structure of a PDF file. I'm assuming you are more familiar with HTML than PDF, and it may come as a surprise to you to discover that PDF does not have the same kind of metadata that an HTML file would. This is especially true with text, there is no concept of text structure in a PDF file at all, no lines, no paragraphs, sentences, nothing. All there is in a PDF file is 'this text' and 'put it here on the page'. The encoding used for the text may even be custom, and ther emay be no possible method (other than OCR) for determining the actual text content (eg the Unicode values). Sentences don't even have to be contiguous, I could (and PDF files sometimes do) write at the top left of the page "The quick brown" then drop to the bottom of the page, write "Copyright mother goose", then jump back up to the top of the page, but moved along to the right, and write "jumped over the lazy dog". Then move back to the left, between the two existing pieces of text at the top, and write "fox". So that's why you don't get the paragraphs you exepct, there aren't any to start with. So by inference no, you can't have MuPDF keep paragraphs together. If you just look at the text and the order it appears in the PDF file, it won't reliably tell you much. There is positional information available for the text though, so you can post-process the extracted text and apply your own heuristics to try and decide where paragraphs, columns, tables etc are. > 2. Save pictures as external JPG/PNG files instead of including them in the HTML file. No, currently there is no way to do that. Obviously the code could be altered so that the image data is written to a series of files, and links to those files inserted into the HTML in their place. But it can't be done with the existing code by simply flipping a switch or something. Caveat: I am not one of the MuPDF developers, the information above regarding image data was provided to me by one of the developers though, the text information is by me, so if its wrong I can be blamed. Regards, Ken
Back to comp.lang.postscript | Previous | Next — Previous in thread | Next in thread | Find similar
[mutool] Save images as independent files + manage paragraphs? Heck Lennon <frdtheman@gmail.com> - 2020-04-23 15:38 -0700
Re: [mutool] Save images as independent files + manage paragraphs? luser droog <luser.droog@gmail.com> - 2020-04-23 21:32 -0700
Re: [mutool] Save images as independent files + manage paragraphs? ken <ken@spamcop.net> - 2020-04-24 09:13 +0100
Re: [mutool] Save images as independent files + manage paragraphs? Heck Lennon <frdtheman@gmail.com> - 2020-04-24 11:05 -0700
Re: [mutool] Save images as independent files + manage paragraphs? news@zzo38computer.org.invalid - 2020-04-24 22:50 -0700
csiph-web