Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.postscript > #309 > unrolled thread
| Started by | no.top.post@gmail.com |
|---|---|
| First post | 2011-08-26 18:46 +0000 |
| Last post | 2011-08-27 00:21 +0100 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.postscript
techniques of extracting the original ASCII? no.top.post@gmail.com - 2011-08-26 18:46 +0000
Re: techniques of extracting the original ASCII? RedGrittyBrick <RedGrittyBrick@SpamWeary.invalid> - 2011-08-27 00:21 +0100
| From | no.top.post@gmail.com |
|---|---|
| Date | 2011-08-26 18:46 +0000 |
| Subject | techniques of extracting the original ASCII? |
| Message-ID | <j38pmg$mvp$1@dont-email.me> |
I previously asked here:
Why is ps/pdf quirky with "ff" ?
and got the answers that "it's rendered by a glyph".
Well of course,but WHY? Why then isn't "a"
"rendered by a glyph"?
Since char("f") was originally entered by a keyboard as
ASCII, why should *IT*, and not other chars be transformed?
------------
I'm trying to absorb the contents of [230069 bytes]
http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
which is frustrating since I can't extract any of the text to
my own notes.
I'm surprised that it's a: %PDF-1.2
because it's a newish document.
And what's not understandable, is that uless the original
'typed script' was given to a Chinese wood carver who
treated each char as an individual piece of art, why can't
linux-tools nor Win7-adobe extract the original text
[except for that of one diagram] ?!
And although close examination of the rendering does
show that the 'same ascii-wise chars' DO have slightly
difference appearances, if the commonality of eg. all
char("N")s had not been factored-out, the file would
be massively increased in size.
How do you solve this problem of not being able to get
as ascii version of such 'texts' ?
Does ps & pdf render characters sequentially, or pixels,
or columns or glyphs sequentially; and if by glyphs: do
they have variable positions on the screen?
== TIA.
[toc] | [next] | [standalone]
| From | RedGrittyBrick <RedGrittyBrick@SpamWeary.invalid> |
|---|---|
| Date | 2011-08-27 00:21 +0100 |
| Message-ID | <p7udnZW0IYHrt8XTnZ2dnUVZ7tSdnZ2d@bt.com> |
| In reply to | #309 |
On 26/08/2011 19:46, no.top.post@gmail.com wrote:
> I previously asked here:
> Why is ps/pdf quirky with "ff" ?
>
> and got the answers that "it's rendered by a glyph".
> Well of course,but WHY? Why then isn't "a"
> "rendered by a glyph"?
See http://en.wikipedia.org/wiki/Portable_Document_Format#Text
letters and other symbols in PS and PDF files are often represented in
the file in much the same way they are in a text file edited by notepad
except that the PS and PDF files specify the font to be used. This
requires that the font be found in the readers operating system or that
the font (or a subset of it) be embedded in the document.
When a font is embedded in a document, there is no need for it to have
the "normal" encoding. For example "A" is character 65 in ASCII and so
in files is often represented by a byte with a numeric value in decimal
of 65. However if the encoding is private the software that produced the
PDF may have created a custom encoding. This makes it harder for a
plain-text extraction program to know what character might be
represented by a particular byte value in a PDF.
An application that creates PDF files could also represent a specific
letter as a sequence of vector-graphics commands or as a bitmap image. A
document produced by an image scanner might naturally do the latter
unless it includes OCR capabilities.
> Since char("f") was originally entered by a keyboard as
> ASCII, why should *IT*, and not other chars be transformed?
Because it is part of a ligature. Programs that care about good
typography will make use of available ligatures to provide a more
readable and aesthetically pleasing result.
> ------------
> I'm trying to absorb the contents of [230069 bytes]
> http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
> which is frustrating since I can't extract any of the text to
> my own notes.
> I'm surprised that it's a: %PDF-1.2
> because it's a newish document.
> And what's not understandable, is that uless the original
> 'typed script' was given to a Chinese wood carver who
> treated each char as an individual piece of art, why can't
> linux-tools nor Win7-adobe extract the original text
> [except for that of one diagram] ?!
Perhaps because the text has not been encoded in the most simple way.
> And although close examination of the rendering does
> show that the 'same ascii-wise chars' DO have slightly
> difference appearances, if the commonality of eg. all
> char("N")s had not been factored-out, the file would
> be massively increased in size.
Perhaps.
> How do you solve this problem of not being able to get
> as ascii version of such 'texts' ?
Ask the author for a text version?
Use OCR?
> Does ps& pdf render characters sequentially
PS and PDF don't really do rendering, they are just file formats. The
rendering is done by applications such as GhostScript, Adobe Acrobat
Reader, Foxit reader and so on.
> , or pixels, or columns or glyphs sequentially;
Any of the above can be represented in PS or PDF. Sequentially or not.
> and if by glyphs: do
> they have variable positions on the screen?
Yes, you can create a PS or PDF file in which the order of the letters
(however represented) in the file bears no relationship to the order in
which they appear once rendered on screen or on paper.
Consider the word "hat" I could write the following postscript file to
write that word when passed to a Postscript interpreter (such as the
ones in my laser-printers):
%!PS
/Helvetica 12 selectfont
110 100 moveto (a) show
120 100 moveto (t) show
100 100 moveto (h) show
showpage
To extract the word "hat" you pretty much have to use a postscript
interpreter (equivalent to the one in the program that does the
rendering to a visible form)
A program might replace the simplistic way of representing "t" with a
procedure call that calls a sequence of line-drawing commands that will
draw a special version of that letter for some purpose (artistic,
whatever) The complexity is only limited by the imagination of the
programmer who wrote the program that produced the PS or PDF file.
PDF is in some ways simpler - it isn't a full-blown programming language
in it's own right - but I think many of the same difficulties apply.
--
RGB
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.postscript
csiph-web