NNTP-Posting-Date: Sun, 28 Aug 2011 01:25:05 -0500
From: ken <ken@spamcop.net>
Newsgroups: comp.lang.postscript
Subject: Re: Re (2): Subject: techniques of extracting the original ASCII?
Date: Sun, 28 Aug 2011 07:24:55 +0100
Message-ID: <MPG.28c3ef44d4c061cb98985d@usenet.plus.net>
References: <MPG.28b19188fb469bc0989859@usenet.plus.net> <j3b0dm$qma$1@dont-email.me>
Reply-To: ken@spamcop.net
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
User-Agent: MicroPlanet-Gravity/3.0.4
Lines: 66
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-910TdbH1VconuXZhdBwBi0tbJlZ8nVhchyV+PXREl2QT26xH+rv7v3T0L27LOGZ4doiNqMFjLzD7+ds!Y4HPoGVsnxoToQdgyhsWDa/Iut6wBeECHHwehGH7fa4Ka3Jz0aUAFt4ADGBBx8W0WuzuDGO5mzcc!Ysk55Z/tuB6R8IBr
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 3386
Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.stben.net!border3.nntp.ams.giganews.com!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.brightview.co.uk!news.brightview.co.uk.POSTED!not-for-mail
Xref: x330-a1.tempe.blueboxinc.net comp.lang.postscript:313

In article <j3b0dm$qma$1@dont-email.me>, no.top.post@gmail.com says...

> How many of these <f-something> ligatures are there?

In Latin languages, I think ff, ffi, ffl, are the most common, but see:

http://en.wikipedia.org/wiki/Typographic_ligature

> Do they have a common ID, for the different fonts?

I'm, not sure what you mean by a common ID. The glyphs are named things 
like '/ffi'.

> Can they be easily 'extracted' from a *.pdf?

The same as any other text, yes.

> Can the renderer be modified to do:
> IF <fi-ligature> THEN put("f); put("i") ?

NO. Also, why would you care ebout rendering ?

> > You could try using the new experimental 'txtwrite' device in the 
latest 
> > version of Ghostscript (9.04), which will produce UTF-16 (NOT ASCII) 
> > output from a file. I plan to add UTF-8 later, which would be ASCII 
> > output if the input is ASCII. I'm not planning to add ligature 
> > conversion but you could do it yourself easily enough.
> >  
> Isn't ASCII to UTF-16 a one-to-one-mapping?

The content of the PDF file may have characters encoded in some fashion 
other than ASCII, and almost certainly not UTF-16. Why would it be a 
one-way map ? If I k now its ASCII, then I can convert it to something 
else (eg UTF-16 or UTF-8) and vice versa.

> Does "ligature conversion" mean eg. converting glyph(fi)
> to chars("fi")

Yes, exactly.

> and if so why don't the converters do it?

Because its not atually the same thing.

> Even Win7's adobe <pdf to speech> can't handle "ff".

Well speech output is a little different.

 
> Where's the basic *.pdf renderer algorithm explained?

Everything about PDF is explained in the PDF Reference Manual. Its not 
(IMO) as good a document as the PostScript Language Reference Manual, 
but it isn't too bad. One of the biggest problems is that Adobe Acrobat 
doesn't actually stick to it, and will open many files which are 
technically illegal.

There are few details of rendering, because (with the exception of 
things like pixel coverage) it doesn't matter how you render it, this is 
left up to the rasteriser. If you really want to know more about 
rendering graphical objects in PDF, then you should also read the 
PostScript Language Reference Manual, which has more details.


				Ken