cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Coward <>
Subject Re: [RT] Views for readers
Date Wed, 13 Aug 2003 11:29:30 GMT

> Frederic's question about search engine integration led me to 
> questioning myself at how Cocoon's Lucene integration could be able to 
> transparently index Word & PDF documents along with XML-produced 
> documents.

I have been wondering that too. At my company, we put together a simple 
web management tool to put small collections of documents into a web 
frame for a client. Pretty useless, but it's what he wanted.

At the time I had thought it may be possible to just improve Lucene so 
it could understand binary files by introducing mime-type triggerable 
filter modules that converted to text on the input stream. After all, if 
the text were only going to be used for indexing, it wouldn't matter if 
the text wasn't available within Cocoon itself. In any case he's happy 
with what he has and we're happily doing other stuff.

Perhaps if the individual extractors are part of specialised readers for 
specific types of documents, then you could configure the label for the 
XML they return? That would allow for the duality of that behaviour to 
be mostly concealed and managed from within Cocoon with little effect to 
the sitemap.

I personally find it tempting to think that it may be possible to  rip 
out XML from any of these formats, and do with it as we wish, 
particulary when I saw that programs like catdoc could recognize the 
tables even from Word 2k documents. But I often find myself thinking 
back against that, and that maybe I should represent all content (even 
document content) semantically in XML and let rendering technologies 
(PDFSerializer, POI) handle binary output, and perhaps leverage document 
importers that map those documents back to XML (they all seem to be 
proprietary, big buck solutions from what I see currently, though). In 
any case, it does seem that is certainly a ways off in the future *sigh*

Hmm, an OCR extractor would be way cool for faxes too!

just my 2c, i never say anything most of the time, anyway

View raw message