xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srinivasan Ramaswamy" <sramasw...@pyramidci.com>
Subject Re: Cocoon the other way???
Date Wed, 31 May 2000 09:41:38 GMT

> >"anything" -> XML translation is not for translating the
> >formattingproperties into tags. But more for extracting the content
> from the
> >pdf. Out in the field are numrous content in HTML and PDF i.e that is
> in
> >their final presentation form. How do you get to make them XML? I think
> that
> >is the question. Right now in the world, a large percentage of web
> content
> >is in static files (HTML/PDF etc.) If we want to achieve a WWW that
> consists
> >of XML data everywhere, don't you need a tool to retro-generate the
> content
> >out of the presentation format.
> >
> A nice goal.  But you can not go from Postscript or PDF to a marked-up
> document in general.  Sometimes you can, but that's a matter of luck.
> Basically, these are programs.  You run the program, and when it is done
> you have marks on paper (or a screen).  Humans can read the marks.  The
> program can even redefine the meaning of some of its standard features,
> so that you cannot be sure of the meaning of pieces of code you are
> looking at unless you are interpreting the code as you go along.  Even
> text may be broken up, even inside a word, so you cannot just look for
> characters in parentheses and be sure you have gotten something
> intelliglble.
> And how would you mark it up, anyway?  You wouldn't know what the author
> intended unless you read and understood the text (and sometimes the
> images).  Try, for example, to take a PDF file that has a table
> surrounded by text.  Open it in Acrobat Reader and select the text and
> the table.  Now copy and paste it to a text document.  Usually, the
> order of text and table contents gets mixed up and becomes useless.
> So, you could get text out of some particular PS/PDF document.  You
> can't automatically reverse engineer the documents in general, because
> the PS/PDF program creates a non-reversible transformation.

I agree that one cannot have a general program to fish out the data from
PDF. But there are consumer companies which send out credit card, insurance,
bank and order confirmation statements in HTML, PDF or maybe even plain text
to the consumer. The business logic exists already - proven, tested and
running. No one wants to tinker with the them. So you come up with a
rogram  - restricted to a PARTICULAR output format ONLY - that will convert
this format to meaningful XML. Of course, you need some kind of parameters
* a text present at a particular location beginning of the page or
* after "From:" text in the document beginning
and so on .......


 Srinivasan Ramaswamy
 Senior System Analyst                   e-mail : sramaswamy@pyramidci.com
 Pyramid Consulting Inc.                 phone  : 770-248-0024 Ext. 501
 5335 Triangle Parkway Suite 510         fax    : 770-248-9560
 Norcross, GA 30092                      web    : www.pyramidci.com

     "If two men agree on everything, you may be sure that
      one of them is doing the thinking."

View raw message