xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul.Wa...@wdr.com
Subject Re: PDF to XML - LOL!
Date Fri, 28 Jan 2000 13:59:32 GMT
     It has been really interesting looking at these threads on this 
     particular item and it gives me another perspective on PDF -> XML
     
     My perspective on posting the item was that, this system has legacy 
     docs in PDF and that from an architectual stand point if I can get 
     them into XML then I can react to the business alot quicker. 
     
     Really all I want to do is put together a frame work where the PDF 
     docs can be mixed with associated data from other systems and then 
     served relevant user service. ie: WWW, WAP, B2B, PDA eBook? other 
     messaging system, anything else that comes along.
     
     I see what ever I build now should not be a quick fix to get PDF mixed 
     in with some other stuff to deliver just to the WWW.
     
     To pick up a question in Dan's note, I think I might be able to get  
     the source of a few documents but I would like to point out that we 
     are talking about 10's of thousands of documents in this paricular 
     case. :-( not good.
     
     
     thanks Paul


______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML - LOL!
Author:  dman (dman@es.co.nz) at unix,mime
Date:    28/01/00 12:52


Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...
     
Indeed.
I accept that someone may take it upon themselves to inline a 
representation of binary or propriatary(sp?) data (I still think of PDF 
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever 
suits.
     
The thing is, it's a bit beyond XML translators (at the moment) to look 
at a magazine page and break it up into its constituent bits with 
meaningful tag names. Heck even translating from Word->HTML is a mess 
unless the original has been crafted using style templates 100% of the 
time. In my experience PDF (with its eye on a completely different ball) 
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the 
document even more. 
     
Honestly, if you really need to proceed in this direction, the best 
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a 
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only 
for your select group of identically structured source docs. There will 
be no push-button solution for a while.
     
Seeing as you're looking into this field, have you ever tried to train 
HTML-Transit to do its translations? It'd be like that only worse & less 
accurate.
     
Do you have access to the source documents that the PDFs were distilled 
from? Get hold of them and you _may_ find a better packaged solution 
available.
     
- trying to be constructive this time - 
...dan.


This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.


Mime
View raw message