xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From skoech...@n-soft.com (Sebastien Koechlin I-VISION)
Subject Re: Cocoon the other way???
Date Wed, 31 May 2000 09:46:04 GMT
Srinivasan Ramaswamy wrote:
> "anything" -> XML translation is not for translating the
> formattingproperties into tags. But more for extracting the content from the
> pdf. Out in the field are numrous content in HTML and PDF i.e that is in
> their final presentation form. How do you get to make them XML? I think that
> is the question. Right now in the world, a large percentage of web content
> is in static files (HTML/PDF etc.) If we want to achieve a WWW that consists
> of XML data everywhere, don't you need a tool to retro-generate the content
> out of the presentation format.

With Linux, you can have (came with my RedHat 6.1) ps2ascii.
It's a Aladdin Ghostscript PostScript or PDF to ASCII translator,
it came with ghostscript. You can probably found a Win32 version

With xpdf, you will find 
pdfimages	PDF image extractor
pdftopbm	PDF to Portable Bitmap (PBM) converter
pdftotext	PDF to text converter

Then, you will have to rewrite all the XML markup.

But some PDF files contain fonts whose encodings have been
mangled beyond recognition. There is no way (short of OCR)
to extract text from these files.

I don't know what happen with non-bitmap images.

It's probably a lot of works, most of this have to be
done on the human side.

Sebastien Koechlin

View raw message