Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.postscript > #310
| Date | 2011-08-27 00:21 +0100 |
|---|---|
| From | RedGrittyBrick <RedGrittyBrick@SpamWeary.invalid> |
| Newsgroups | comp.lang.postscript |
| Subject | Re: techniques of extracting the original ASCII? |
| References | <j38pmg$mvp$1@dont-email.me> |
| Message-ID | <p7udnZW0IYHrt8XTnZ2dnUVZ7tSdnZ2d@bt.com> (permalink) |
On 26/08/2011 19:46, no.top.post@gmail.com wrote:
> I previously asked here:
> Why is ps/pdf quirky with "ff" ?
>
> and got the answers that "it's rendered by a glyph".
> Well of course,but WHY? Why then isn't "a"
> "rendered by a glyph"?
See http://en.wikipedia.org/wiki/Portable_Document_Format#Text
letters and other symbols in PS and PDF files are often represented in
the file in much the same way they are in a text file edited by notepad
except that the PS and PDF files specify the font to be used. This
requires that the font be found in the readers operating system or that
the font (or a subset of it) be embedded in the document.
When a font is embedded in a document, there is no need for it to have
the "normal" encoding. For example "A" is character 65 in ASCII and so
in files is often represented by a byte with a numeric value in decimal
of 65. However if the encoding is private the software that produced the
PDF may have created a custom encoding. This makes it harder for a
plain-text extraction program to know what character might be
represented by a particular byte value in a PDF.
An application that creates PDF files could also represent a specific
letter as a sequence of vector-graphics commands or as a bitmap image. A
document produced by an image scanner might naturally do the latter
unless it includes OCR capabilities.
> Since char("f") was originally entered by a keyboard as
> ASCII, why should *IT*, and not other chars be transformed?
Because it is part of a ligature. Programs that care about good
typography will make use of available ligatures to provide a more
readable and aesthetically pleasing result.
> ------------
> I'm trying to absorb the contents of [230069 bytes]
> http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
> which is frustrating since I can't extract any of the text to
> my own notes.
> I'm surprised that it's a: %PDF-1.2
> because it's a newish document.
> And what's not understandable, is that uless the original
> 'typed script' was given to a Chinese wood carver who
> treated each char as an individual piece of art, why can't
> linux-tools nor Win7-adobe extract the original text
> [except for that of one diagram] ?!
Perhaps because the text has not been encoded in the most simple way.
> And although close examination of the rendering does
> show that the 'same ascii-wise chars' DO have slightly
> difference appearances, if the commonality of eg. all
> char("N")s had not been factored-out, the file would
> be massively increased in size.
Perhaps.
> How do you solve this problem of not being able to get
> as ascii version of such 'texts' ?
Ask the author for a text version?
Use OCR?
> Does ps& pdf render characters sequentially
PS and PDF don't really do rendering, they are just file formats. The
rendering is done by applications such as GhostScript, Adobe Acrobat
Reader, Foxit reader and so on.
> , or pixels, or columns or glyphs sequentially;
Any of the above can be represented in PS or PDF. Sequentially or not.
> and if by glyphs: do
> they have variable positions on the screen?
Yes, you can create a PS or PDF file in which the order of the letters
(however represented) in the file bears no relationship to the order in
which they appear once rendered on screen or on paper.
Consider the word "hat" I could write the following postscript file to
write that word when passed to a Postscript interpreter (such as the
ones in my laser-printers):
%!PS
/Helvetica 12 selectfont
110 100 moveto (a) show
120 100 moveto (t) show
100 100 moveto (h) show
showpage
To extract the word "hat" you pretty much have to use a postscript
interpreter (equivalent to the one in the program that does the
rendering to a visible form)
A program might replace the simplistic way of representing "t" with a
procedure call that calls a sequence of line-drawing commands that will
draw a special version of that letter for some purpose (artistic,
whatever) The complexity is only limited by the imagination of the
programmer who wrote the program that produced the PS or PDF file.
PDF is in some ways simpler - it isn't a full-blown programming language
in it's own right - but I think many of the same difficulties apply.
--
RGB
Back to comp.lang.postscript | Previous | Next — Previous in thread | Find similar
techniques of extracting the original ASCII? no.top.post@gmail.com - 2011-08-26 18:46 +0000 Re: techniques of extracting the original ASCII? RedGrittyBrick <RedGrittyBrick@SpamWeary.invalid> - 2011-08-27 00:21 +0100
csiph-web