cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernhard Huber" <berni_hu...@a1.net>
Subject [Status] Searching XML in Cocoon
Date Sat, 15 Dec 2001 17:39:45 GMT
  Hi,
I'd like to commit Searching XML in Cocoon.
I must confess that I have not taken the CVS SSH hurdle, yet.
Moreover I like to know into which branch I should check-in and if its 
into src, or scratchpad.
As this is not final, I think inserting into scratchpad would be better, 
moreover people may use and try it first.
I think using a sitemap would be okay for using the searching, and 
indexing, and demonstrating the usage of these components.
Uhps, and I think I have vioaleted the codeing convents indenting only 2 
spaces, need to reformat before submitting,
is there any tool for that?

Any comments?

Some docu about the feature...

Abstract
Searching XML in Cocoon using Lucene as search engine.

Overview
Lucene ( http://jakarta.apache.org/lucene ) is a indexing & searching API.
Several new Cocoon components utilizes this API to provide "Searching 
XML in Cocoon".

There are two services provided by these components:
Indexing
Searching

Indexing is realized by crawling starting from a base URI, and 
generating a lucene index.
Searching uses the generated lucene index. The index is searched for a 
requested query.

The crawling component is packed in org.apache.cocoon.components.crawler.
Indexing and searching is packed in org.apache.cocoon.components.search. 
A Cocoon generator using the searching components is packaged in 
org.apache.cocoon.generation.

A GUI for searching is implemented by using XSP, and as a generator. 
Both implementions can be used independtly.

Description

As having an existing index is a precondition for searching, the 
description of crawling and indexing is described first; a description 
of the searching follows.

The crawling component provides all links of requested URI. The links of 
a URI are requested by using the Cocoon feature of views. A URI which is 
allowed to get crawled, must provide a view. By default the crawling 
component requests the view links.
A  link view must provide a response of content type 
application/x-cocoon-links.  Using a serializer type links  having src 
org.apache.cocoon.serialization.LinkSerializer will guarentee the 
correct content type.

The indexing component crawls in-depth, starting from a given base URI. 
The indexing component uses a crawler component to receive all links of 
a page. The indexing component filters the response of a crawler.
Filtering asserts following conditions:
Index only resources which have not been indexed already.
Index only resources which are indexable, like documents, ignore images, 
non-xml documents.

Indexing parses an XML document, and produces a lucene document. A 
lucene document may have serval fields, which acts like columns of a 
database table.

Indexing writes the lucene index into a directory, by default the Cocoon 
working directory is used. Moreover a lucene analyzer, and the lucene 
writing mode must be defined.

The searching components uses a created lucene index. The index may be 
created by any lucene indexer.
The searching component must have access to an index directory, and it 
should use the same lucene analyzer as the indexer at creation time of 
the index directory.
The searching component returns all hits of a search, the XSP, and the 
generator filters the hits for a all hits displayed on a page.

The search generator searches the lucene index by using the searching 
components, and
generates XML content.
As sample of the XML content produced by the search generator:

<?xml version="1.0" encoding="UTF-8"?>
<search:results date="1008437081064" query-string="cocoon" 
start-index="0" page-length="10"
  xmlns:search="http://apache.org/cocoon/search/1.0"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <search:hits total-count="125" count-of-pages="13">
    <search:hit rank="0" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="1" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="2" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="3" score="0.93121004" 
uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
    <search:hit rank="4" score="0.93121004" 
uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
    <search:hit rank="5" score="0.7112235" 
uri="http://localhost:8080/cocoon/documents/mail-archives.html"/>
    <search:hit rank="6" score="0.70967746" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/link-serializer.html"/>
    <search:hit rank="7" score="0.6881721" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/text-serializer.html"/>
    <search:hit rank="8" score="0.6881721" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/vrml-serializer.html"/>
    <search:hit rank="9" score="0.6666666" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/svgpng-serializer.html"/>
  </search:hits>
  <search:navigation total-count="125" count-of-pages="13"
    has-next="true" has-previous="false" next-index="10" previous-index="0">
    <search:navigation-page start-index="0"/>
    <search:navigation-page start-index="10"/>
    <search:navigation-page start-index="20"/>
    <search:navigation-page start-index="30"/>
    <search:navigation-page start-index="40"/>
    <search:navigation-page start-index="50"/>
    <search:navigation-page start-index="60"/>
    <search:navigation-page start-index="70"/>
    <search:navigation-page start-index="80"/>
    <search:navigation-page start-index="90"/>
    <search:navigation-page start-index="100"/>
    <search:navigation-page start-index="110"/>
    <search:navigation-page start-index="120"/>
  </search:navigation>
</search:results>

The navigation elements is for easy handling of navigation issues, in a 
xslt.

Bill Of Material:

New packages:
org.apache.cocoon.components.crawler,
org.apache.cocoon.components.search

New avalon components:
org.apache.cocoon.components.crawler.CocoonCrawler
org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl:
  external http crawler for Cocoon. This crawler generates a list of links
  received from a URI request, enhancing it with a cocoon-view query.

org.apache.cocoon.components.IndexHelperField
org.apache.cocoon.components.LuceneCocoonHelper
org.apache.cocoon.components.LuceneCocoonIndexer
org.apache.cocoon.components.LuceneCocoonPager
org.apache.cocoon.components.LuceneCocoonSearcher
org.apache.cocoon.components.LuceneIndexContentHandler
org.apache.cocoon.components.LuceneXMLIndexer
org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl

New sitemap components:
org.apache.cocoon.generation.SearchGenerator

New JUnit testcase:
org.apache.cocoon.generation.test.SearchGeneratorTestCase

New webapp resources:
sitemap.xmap
search-index.xsp
welcome-index.xsp
create-index.xsp
stylesheets/search2html.xsl
lucene_green_300.gif

Compiling & Installing:

For compiling, and at runtime, a lucene.jar is neccessary. This will 
need a changing the build.xml is neccessary, too, for checking availability,
and modifying the webapp sitemap for includeing the search demo.

Installing the the avalon components needs change of the cocoon.xconf 
file inserting the avalon components
org.apache.cocoon.components.LuceneXMLIndexer
org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl.

A sitemap, or subsitemap to be adapted for using the XSP, and the generator.


bye bernhad



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message