cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Wallez <sylvain.wal...@anyware-tech.com>
Subject [RT] Views for readers
Date Wed, 13 Aug 2003 10:02:04 GMT
Frederic's question about search engine integration led me to 
questioning myself at how Cocoon's Lucene integration could be able to 
transparently index Word & PDF documents along with XML-produced documents.

There exists some text-extraction libraries for Word & PDF (e.g. 
http://www.textmining.org/). Now how can we integrate this as 
transparently as possible in Cocoon's search functionnality ?

The Lucene indexer crawls a website and asks for a particular view 
("content") which is used to fill the index. But Word and PDF documents 
being binary files, they're handled by a <map:read> statement, which 
does not handle views. On the other hand, this use case shows that 
having views on binary content may make sense : the "normal" requests 
just sends back the binary content, while a view can use a text/XML 
extraction on these binary files.

So the question is : how could views be plugged to readers ? I must say 
that I don't have an answer, as views contain transformers and a 
serializer, but no generator. So how could we express in the sitemap 
that a particular view on a reader should "replace" that reader by a 
particular generator ? Or should this go through some special readers 
that could also act as generators ?

Or maybe these are silly thoughts and we should use a <map:select> 
directing to a <map:read> or <map:generate> depending on the view. But 
this introduces explicit view management in the pipelines, which doesn't 
seem nice to me.

Any thoughts ?

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Mime
View raw message