xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Morrison <d...@es.co.nz>
Subject Re: PDF to XML - LOL!
Date Fri, 28 Jan 2000 12:52:39 GMT
Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...

Indeed.
I accept that someone may take it upon themselves to inline a
representation of binary or propriatary(sp?) data (I still think of PDF
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever
suits.

The thing is, it's a bit beyond XML translators (at the moment) to look
at a magazine page and break it up into its constituent bits with
meaningful tag names. Heck even translating from Word->HTML is a mess
unless the original has been crafted using style templates 100% of the
time. In my experience PDF (with its eye on a completely different ball)
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
document even more. 

Honestly, if you really need to proceed in this direction, the best
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only
for your select group of identically structured source docs. There will
be no push-button solution for a while.

Seeing as you're looking into this field, have you ever tried to train
HTML-Transit to do its translations? It'd be like that only worse & less
accurate.

Do you have access to the source documents that the PDFs were distilled
from? Get hold of them and you _may_ find a better packaged solution
available.

- trying to be constructive this time -
.dan.

Mime
View raw message