Return-Path: Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 44597 invoked by uid 500); 14 Aug 2003 11:30:03 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: dev@cocoon.apache.org Delivered-To: mailing list dev@cocoon.apache.org Received: (qmail 44583 invoked from network); 14 Aug 2003 11:30:03 -0000 Received: from grunt23.ihug.com.au (203.109.249.143) by daedalus.apache.org with SMTP; 14 Aug 2003 11:30:03 -0000 Received: from p1061-apx1.syd.ihug.com.au (expresso.localdomain) [203.173.150.45] by grunt23.ihug.com.au with esmtp (Exim 3.35 #1 (Debian)) id 19nGIq-0006Mv-00; Thu, 14 Aug 2003 21:30:01 +1000 Received: from jeff by expresso.localdomain with local (Exim 3.35 #1 (Debian)) id 19nGQc-0001Vb-00 for ; Thu, 14 Aug 2003 21:38:02 +1000 Date: Thu, 14 Aug 2003 21:38:02 +1000 From: Jeff Turner To: dev@cocoon.apache.org Subject: Re: [RT] Views for readers Message-ID: <20030814113802.GA4824@expresso.localdomain> References: <3F3A0C9C.3090003@anyware-tech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F3A0C9C.3090003@anyware-tech.com> User-Agent: Mutt/1.5.4i X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote: > Frederic's question about search engine integration led me to > questioning myself at how Cocoon's Lucene integration could be able to > transparently index Word & PDF documents along with XML-produced documents. > > There exists some text-extraction libraries for Word & PDF (e.g. > http://www.textmining.org/). Now how can we integrate this as > transparently as possible in Cocoon's search functionnality ? > > The Lucene indexer crawls a website and asks for a particular view > ("content") which is used to fill the index. But Word and PDF documents > being binary files, they're handled by a statement, which > does not handle views. On the other hand, this use case shows that > having views on binary content may make sense : the "normal" requests > just sends back the binary content, while a view can use a text/XML > extraction on these binary files. > > So the question is : how could views be plugged to readers ? I must say > that I don't have an answer, as views contain transformers and a > serializer, but no generator. So how could we express in the sitemap > that a particular view on a reader should "replace" that reader by a > particular generator ? Or should this go through some special readers > that could also act as generators ? > > Or maybe these are silly thoughts and we should use a > directing to a or depending on the view. But > this introduces explicit view management in the pipelines, which doesn't > seem nice to me. Solution: strongly typed pipelines! :) Imagine if, at each node in the sitemap, we knew what type of content we were dealing with (usually some flavour of XML). Then we could write a single view that behaves differently depending on the _type_ of data: So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would return XML representing the content of the .doc file. I described the same thing in a mail with subject 'Type-aware Views (Re: Link view goodness)'. Same need, different context, same proposed solution. --Jeff > Any thoughts ? > > Sylvain > > -- > Sylvain Wallez Anyware Technologies > http://www.apache.org/~sylvain http://www.anyware-tech.com > { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects } > Orixo, the opensource XML business alliance - http://www.orixo.com > >