Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cocoon.apache.org
Date: Thu, 14 Aug 2003 21:38:02 +1000
From: Jeff Turner <jefft@apache.org>
To: dev@cocoon.apache.org
Subject: Re: [RT] Views for readers
Message-ID: <20030814113802.GA4824@expresso.localdomain>
References: <3F3A0C9C.3090003@anyware-tech.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3F3A0C9C.3090003@anyware-tech.com>
User-Agent: Mutt/1.5.4i

On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
> Frederic's question about search engine integration led me to 
> questioning myself at how Cocoon's Lucene integration could be able to 
> transparently index Word & PDF documents along with XML-produced documents.
> 
> There exists some text-extraction libraries for Word & PDF (e.g. 
> http://www.textmining.org/). Now how can we integrate this as 
> transparently as possible in Cocoon's search functionnality ?
> 
> The Lucene indexer crawls a website and asks for a particular view 
> ("content") which is used to fill the index. But Word and PDF documents 
> being binary files, they're handled by a <map:read> statement, which 
> does not handle views. On the other hand, this use case shows that 
> having views on binary content may make sense : the "normal" requests 
> just sends back the binary content, while a view can use a text/XML 
> extraction on these binary files.
> 
> So the question is : how could views be plugged to readers ? I must say 
> that I don't have an answer, as views contain transformers and a 
> serializer, but no generator. So how could we express in the sitemap 
> that a particular view on a reader should "replace" that reader by a 
> particular generator ? Or should this go through some special readers 
> that could also act as generators ?
> 
> Or maybe these are silly thoughts and we should use a <map:select> 
> directing to a <map:read> or <map:generate> depending on the view. But 
> this introduces explicit view management in the pipelines, which doesn't 
> seem nice to me.

Solution: strongly typed pipelines! :)

Imagine if, at each node in the sitemap, we knew what type of content we
were dealing with (usually some flavour of XML).  Then we could write a
single view that behaves differently depending on the _type_ of data:

<map:view name="indexablecontent" from-position="first">
  <map:select type="xml-type">
    <map:when test="docbook">
      <map:transform src="docbook2whatever.xsl"/>
    </map:when>
    <map:when test="tei">
      <map:transform src="tei2whatever.xsl"/>
    </map:when>
    <map:when test="msword">
      <map:transform src="word2whatever.xsl"/>
    </map:when>
  </map:select>
</map:view>

So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
return XML representing the content of the .doc file.

I described the same thing in a mail with subject 'Type-aware Views (Re:
Link view goodness)'.  Same need, different context, same proposed
solution.


--Jeff


> Any thoughts ?
> 
> Sylvain
> 
> -- 
> Sylvain Wallez                                  Anyware Technologies
> http://www.apache.org/~sylvain           http://www.anyware-tech.com
> { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
> Orixo, the opensource XML business alliance  -  http://www.orixo.com
> 
>