xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas B. Passin" <tpas...@mitretek.org>
Subject Re: Cocoon the other way???
Date Wed, 31 May 2000 14:21:48 GMT
Srinivasan Ramaswamy wrote:

>"anything" -> XML translation is not for translating the
>formattingproperties into tags. But more for extracting the content
from the
>pdf. Out in the field are numrous content in HTML and PDF i.e that is
>their final presentation form. How do you get to make them XML? I think
>is the question. Right now in the world, a large percentage of web
>is in static files (HTML/PDF etc.) If we want to achieve a WWW that
>of XML data everywhere, don't you need a tool to retro-generate the
>out of the presentation format.

A nice goal.  But you can not go from Postscript or PDF to a marked-up
document in general.  Sometimes you can, but that's a matter of luck.
Basically, these are programs.  You run the program, and when it is done
you have marks on paper (or a screen).  Humans can read the marks.  The
program can even redefine the meaning of some of its standard features,
so that you cannot be sure of the meaning of pieces of code you are
looking at unless you are interpreting the code as you go along.  Even
text may be broken up, even inside a word, so you cannot just look for
characters in parentheses and be sure you have gotten something

And how would you mark it up, anyway?  You wouldn't know what the author
intended unless you read and understood the text (and sometimes the
images).  Try, for example, to take a PDF file that has a table
surrounded by text.  Open it in Acrobat Reader and select the text and
the table.  Now copy and paste it to a text document.  Usually, the
order of text and table contents gets mixed up and becomes useless.

So, you could get text out of some particular PS/PDF document.  You
can't automatically reverse engineer the documents in general, because
the PS/PDF program creates a non-reversible transformation.

Tom Passin

View raw message