xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philipp Knirck" <p...@maas.de>
Subject RE: PDF to XML - LOL!
Date Fri, 28 Jan 2000 14:22:51 GMT

if u care for AFP --> PDF
MAAS High Tech has developed a AFP2Web converter which does just that

www.afp2web.de


check it out!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Mit freundlichen Grüßen

Philipp Knirck

MAAS High Tech Software GmbH
Siemensweg 4
70794 Filderstadt - Bonlanden
Tel.  0711 - 77 91 7(0) - 39

Mobil 0177 - 34 02 113

Email: mailto:phil@maas.de


> -----Original Message-----
> From: Dan Morrison [mailto:dman@es.co.nz]
> Sent: Freitag, 28. Januar 2000 12:53
> To: general@xml.apache.org
> Subject: Re: PDF to XML - LOL!
>
>
> Pierpaolo Fumagalli wrote:
> > ... You cant "recontextualize"
> > those informations that were extracted from their context...
>
> Indeed.
> I accept that someone may take it upon themselves to inline a
> representation of binary or propriatary(sp?) data (I still think of PDF
> as propriatary, in comparison to XML anyway).
> I guess you're welcome to introduce a <UUENCODE> block or whatever
> suits.
>
> The thing is, it's a bit beyond XML translators (at the moment) to look
> at a magazine page and break it up into its constituent bits with
> meaningful tag names. Heck even translating from Word->HTML is a mess
> unless the original has been crafted using style templates 100% of the
> time. In my experience PDF (with its eye on a completely different ball)
> tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
> document even more.
>
> Honestly, if you really need to proceed in this direction, the best
> you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
> 'save as plain text' function in the DTP packages.
> OK, possibly you can tune it to recognise titles & bylines - but only
> for your select group of identically structured source docs. There will
> be no push-button solution for a while.
>
> Seeing as you're looking into this field, have you ever tried to train
> HTML-Transit to do its translations? It'd be like that only worse & less
> accurate.
>
> Do you have access to the source documents that the PDFs were distilled
> from? Get hold of them and you _may_ find a better packaged solution
> available.
>
> - trying to be constructive this time -
> .dan.

Mime
View raw message