Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.postscript > #3509

Re: [mutool] Save images as independent files + manage paragraphs?

From ken <ken@spamcop.net>
Newsgroups comp.lang.postscript
Subject Re: [mutool] Save images as independent files + manage paragraphs?
Date 2020-04-24 09:13 +0100
Message-ID <MPG.390caff3cf3edb419898b0@usenet.plus.net> (permalink)
References <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>

Show all headers | View raw


In article <40460c51-4b58-4219-8331-d1e55fb55efd@googlegroups.com>, 
frdtheman@gmail.com says...

> According to Artifex*, this newsgroup is one of the ways to ask 
questions.

It is, but essentially for Ghostscript (which is a PostScript 
interpreter) rather than MuPDF. You may find you get answers more 
quickly (and indeed better informed ones) by using IRC and joining the 
#mupdf channel on freenode.net


> By default*, "mutool draw" saves pictures within the HTML files as 
base64, and breaks paragraphs into indepdent lines with <p>?</p>.
> 
> mutool draw -F html -o out.%d.html in.pdf
> 
> I was wondering if there were a way to?
> 1. Have it keep paragraphs together

OK you may need to do some more research on the structure of a PDF file. 
I'm assuming you are more familiar with HTML than PDF, and it may come 
as a surprise to you to discover that PDF does not have the same kind of 
metadata that an HTML file would.

This is especially true with text, there is no concept of text structure 
in a PDF file at all, no lines, no paragraphs, sentences, nothing. All 
there is in a PDF file is 'this text' and 'put it here on the page'.

The encoding used for the text may even be custom, and ther emay be no 
possible method (other than OCR) for determining the actual text content 
(eg the Unicode values).

Sentences don't even have to be contiguous, I could (and PDF files 
sometimes do) write at the top left of the page "The quick brown" then 
drop to the bottom of the page, write "Copyright mother goose", then 
jump back up to the top of the page, but moved along to the right, and 
write "jumped over the lazy dog". Then move back to the left, between 
the two existing pieces of text at the top, and write "fox".

So that's why you don't get the paragraphs you exepct, there aren't any 
to start with. So by inference no, you can't have MuPDF keep paragraphs 
together.

If you just look at the text and the order it appears in the PDF file, 
it won't reliably tell you much. There is positional information 
available for the text though, so you can post-process the extracted 
text and apply your own heuristics to try and decide where paragraphs, 
columns, tables etc are.


> 2. Save pictures as external JPG/PNG files instead of including them in the HTML file.

No, currently there is no way to do that. Obviously the code could be 
altered so that the image data is written to a series of files, and 
links to those files inserted into the HTML in their place.

But it can't be done with the existing code by simply flipping a switch 
or something.


Caveat: I am not one of the MuPDF developers, the information above 
regarding image data was provided to me by one of the developers though, 
the text information is by me, so if its wrong I can be blamed.


			Regards,

				Ken

Back to comp.lang.postscript | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

[mutool] Save images as independent files + manage paragraphs? Heck Lennon <frdtheman@gmail.com> - 2020-04-23 15:38 -0700
  Re: [mutool] Save images as independent files + manage paragraphs? luser droog <luser.droog@gmail.com> - 2020-04-23 21:32 -0700
  Re: [mutool] Save images as independent files + manage paragraphs? ken <ken@spamcop.net> - 2020-04-24 09:13 +0100
    Re: [mutool] Save images as independent files + manage paragraphs? Heck Lennon <frdtheman@gmail.com> - 2020-04-24 11:05 -0700
    Re: [mutool] Save images as independent files + manage paragraphs? news@zzo38computer.org.invalid - 2020-04-24 22:50 -0700

csiph-web