cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Wallez <sylvain.wal...@anyware-tech.com>
Subject Re: [RT] Views for readers
Date Thu, 14 Aug 2003 11:41:55 GMT
Jeff Turner wrote:

>On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
>  
>
>>Frederic's question about search engine integration led me to 
>>questioning myself at how Cocoon's Lucene integration could be able to 
>>transparently index Word & PDF documents along with XML-produced documents.
>>
>>There exists some text-extraction libraries for Word & PDF (e.g. 
>>http://www.textmining.org/). Now how can we integrate this as 
>>transparently as possible in Cocoon's search functionnality ?
>>
>>The Lucene indexer crawls a website and asks for a particular view 
>>("content") which is used to fill the index. But Word and PDF documents 
>>being binary files, they're handled by a <map:read> statement, which 
>>does not handle views. On the other hand, this use case shows that 
>>having views on binary content may make sense : the "normal" requests 
>>just sends back the binary content, while a view can use a text/XML 
>>extraction on these binary files.
>>
>>So the question is : how could views be plugged to readers ? I must say 
>>that I don't have an answer, as views contain transformers and a 
>>serializer, but no generator. So how could we express in the sitemap 
>>that a particular view on a reader should "replace" that reader by a 
>>particular generator ? Or should this go through some special readers 
>>that could also act as generators ?
>>
>>Or maybe these are silly thoughts and we should use a <map:select> 
>>directing to a <map:read> or <map:generate> depending on the view. But

>>this introduces explicit view management in the pipelines, which doesn't 
>>seem nice to me.
>>    
>>
>
>Solution: strongly typed pipelines! :)
>
>Imagine if, at each node in the sitemap, we knew what type of content we
>were dealing with (usually some flavour of XML).  Then we could write a
>single view that behaves differently depending on the _type_ of data:
>
><map:view name="indexablecontent" from-position="first">
>  <map:select type="xml-type">
>    <map:when test="docbook">
>      <map:transform src="docbook2whatever.xsl"/>
>    </map:when>
>    <map:when test="tei">
>      <map:transform src="tei2whatever.xsl"/>
>    </map:when>
>    <map:when test="msword">
>      <map:transform src="word2whatever.xsl"/>
>    </map:when>
>  </map:select>
></map:view>
>

Ah, ok, the "strongly type pipelines" are a different wording for 
"content-aware selectors" !

>So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
>return XML representing the content of the .doc file.
>
>I described the same thing in a mail with subject 'Type-aware Views (Re:
>Link view goodness)'.  Same need, different context, same proposed
>solution.
>

Not exactly : the use case here is that we have a binary file which is 
normally sent as is to the browser using a reader. It is _not_ parsed as 
an XML stream. So we can't attach a view to these kinds of URLs since 
views provide a different _ending_ to a pipeline, meaning there must 
exist at least a generator and optionnaly one or more transformers at 
the point where processing is directed to the view.

So even content-aware selectors don't solve this problem...

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Mime
View raw message