Re: Re (2): Subject: techniques of extracting the original ASCII?

From	ken <ken@spamcop.net>
Newsgroups	comp.lang.postscript
Subject	Re: Re (2): Subject: techniques of extracting the original ASCII?
Date	2011-08-28 07:24 +0100
Message-ID	<MPG.28c3ef44d4c061cb98985d@usenet.plus.net> (permalink)
References	<MPG.28b19188fb469bc0989859@usenet.plus.net> <j3b0dm$qma$1@dont-email.me>

Show all headers | View raw

In article <j3b0dm$qma$1@dont-email.me>, no.top.post@gmail.com says...

> How many of these <f-something> ligatures are there?

In Latin languages, I think ff, ffi, ffl, are the most common, but see:

http://en.wikipedia.org/wiki/Typographic_ligature

> Do they have a common ID, for the different fonts?

I'm, not sure what you mean by a common ID. The glyphs are named things 
like '/ffi'.

> Can they be easily 'extracted' from a *.pdf?

The same as any other text, yes.

> Can the renderer be modified to do:
> IF <fi-ligature> THEN put("f); put("i") ?

NO. Also, why would you care ebout rendering ?

> > You could try using the new experimental 'txtwrite' device in the 
latest 
> > version of Ghostscript (9.04), which will produce UTF-16 (NOT ASCII) 
> > output from a file. I plan to add UTF-8 later, which would be ASCII 
> > output if the input is ASCII. I'm not planning to add ligature 
> > conversion but you could do it yourself easily enough.
> >  
> Isn't ASCII to UTF-16 a one-to-one-mapping?

The content of the PDF file may have characters encoded in some fashion 
other than ASCII, and almost certainly not UTF-16. Why would it be a 
one-way map ? If I k now its ASCII, then I can convert it to something 
else (eg UTF-16 or UTF-8) and vice versa.

> Does "ligature conversion" mean eg. converting glyph(fi)
> to chars("fi")

Yes, exactly.

> and if so why don't the converters do it?

Because its not atually the same thing.

> Even Win7's adobe <pdf to speech> can't handle "ff".

Well speech output is a little different.

 
> Where's the basic *.pdf renderer algorithm explained?

Everything about PDF is explained in the PDF Reference Manual. Its not 
(IMO) as good a document as the PostScript Language Reference Manual, 
but it isn't too bad. One of the biggest problems is that Adobe Acrobat 
doesn't actually stick to it, and will open many files which are 
technically illegal.

There are few details of rendering, because (with the exception of 
things like pixel coverage) it doesn't matter how you render it, this is 
left up to the rasteriser. If you really want to know more about 
rendering graphical objects in PDF, then you should also read the 
PostScript Language Reference Manual, which has more details.


				Ken

Back to comp.lang.postscript | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Subject: techniques of extracting the original ASCII? "NoHtmlMailsPlease" <UsePlainText@dog.edu> - 2011-08-12 20:59 +0200
  Re: Subject: techniques of extracting the original ASCII? ken <ken@spamcop.net> - 2011-08-14 09:03 +0100
    Re (2): Subject: techniques of extracting the original ASCII? no.top.post@gmail.com - 2011-08-27 14:53 +0000
      Re: Re (2): Subject: techniques of extracting the original ASCII? ken <ken@spamcop.net> - 2011-08-28 07:24 +0100
  Re: Subject: techniques of extracting the original ASCII? bugbear <bugbear@trim_papermule.co.uk_trim> - 2011-08-15 10:09 +0100
  Re: Subject: techniques of extracting the original ASCII? John Reiser <jreiserfl@comcast.net> - 2011-08-15 06:10 -0700

csiph-web