NNTP-Posting-Date: Sun, 28 Aug 2011 01:25:05 -0500 From: ken Newsgroups: comp.lang.postscript Subject: Re: Re (2): Subject: techniques of extracting the original ASCII? Date: Sun, 28 Aug 2011 07:24:55 +0100 Message-ID: References: Reply-To: ken@spamcop.net MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit User-Agent: MicroPlanet-Gravity/3.0.4 Lines: 66 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-910TdbH1VconuXZhdBwBi0tbJlZ8nVhchyV+PXREl2QT26xH+rv7v3T0L27LOGZ4doiNqMFjLzD7+ds!Y4HPoGVsnxoToQdgyhsWDa/Iut6wBeECHHwehGH7fa4Ka3Jz0aUAFt4ADGBBx8W0WuzuDGO5mzcc!Ysk55Z/tuB6R8IBr X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 3386 Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.stben.net!border3.nntp.ams.giganews.com!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.brightview.co.uk!news.brightview.co.uk.POSTED!not-for-mail Xref: x330-a1.tempe.blueboxinc.net comp.lang.postscript:313 In article , no.top.post@gmail.com says... > How many of these ligatures are there? In Latin languages, I think ff, ffi, ffl, are the most common, but see: http://en.wikipedia.org/wiki/Typographic_ligature > Do they have a common ID, for the different fonts? I'm, not sure what you mean by a common ID. The glyphs are named things like '/ffi'. > Can they be easily 'extracted' from a *.pdf? The same as any other text, yes. > Can the renderer be modified to do: > IF THEN put("f); put("i") ? NO. Also, why would you care ebout rendering ? > > You could try using the new experimental 'txtwrite' device in the latest > > version of Ghostscript (9.04), which will produce UTF-16 (NOT ASCII) > > output from a file. I plan to add UTF-8 later, which would be ASCII > > output if the input is ASCII. I'm not planning to add ligature > > conversion but you could do it yourself easily enough. > > > Isn't ASCII to UTF-16 a one-to-one-mapping? The content of the PDF file may have characters encoded in some fashion other than ASCII, and almost certainly not UTF-16. Why would it be a one-way map ? If I k now its ASCII, then I can convert it to something else (eg UTF-16 or UTF-8) and vice versa. > Does "ligature conversion" mean eg. converting glyph(fi) > to chars("fi") Yes, exactly. > and if so why don't the converters do it? Because its not atually the same thing. > Even Win7's adobe can't handle "ff". Well speech output is a little different. > Where's the basic *.pdf renderer algorithm explained? Everything about PDF is explained in the PDF Reference Manual. Its not (IMO) as good a document as the PostScript Language Reference Manual, but it isn't too bad. One of the biggest problems is that Adobe Acrobat doesn't actually stick to it, and will open many files which are technically illegal. There are few details of rendering, because (with the exception of things like pixel coverage) it doesn't matter how you render it, this is left up to the rasteriser. If you really want to know more about rendering graphical objects in PDF, then you should also read the PostScript Language Reference Manual, which has more details. Ken