cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernhard Huber" <berni_hu...@a1.net>
Subject Re: Ant: Re: Adding XML searching with Lucene
Date Sun, 09 Dec 2001 17:02:28 GMT
Hi,

Using the avalon components might help to speed up the searching, as I 
changed the classes to Recyclable,
and corrected a bug in the IndexReaderCache -giving me a 
TooManyOpenedFiles exception.
As there will be a lot of clients doing search, it is important to have 
a fast search, hence:
The indexReader is like a JdbcConnection, pooling would speed up. Only 
in case of the changing the index it
is neccessary to recreate the indexReader.

>Why don't you throw in your skeleton ideas here and we discuss then in
>the open? 
>
Okay, perhaps i have misunderstood something.

>>* I will implement some paging for the search result, if there are too
>>much search result for displaying on a single page.
>>
>
>Yep, this is a must do.
>
I have done this but still using the old package names.

I added a LuceneCocoonPager (I know the names...) class, doing the hits 
per page calculation, and wrapping the Hits class. You will find it in 
the attachment plus the modified searchindex.xsp.

If searchindex.xsp stays I'd like to have some xsp-stylesheet for doing 
the reendering of the paging stuff.
Is there some easy way not having to declare the logicsheet in the 
cocoon.xconf? For the developing I'd like
to declare the logicsheet inside the xsp itself.

This paging stuff should go into the 

org.apache.cocoon.generator.SearchGenerator, too. 
This way the generator is able to generate only the search result which will be displayed.


>>* I will study the Main class for the internal crawling..
>>
>
>Great
>
Okay, it got an overview using the environment.commandline.* classes.
Now i have a question about crawling&indexing:

As it is now I have a xsp to trigger the crawling&indexing. It uses http 
URLs to access the xml-content for indexing.
Now to speed up I see following possibilities:

First still staying in an servlet-context environment:
* For Servlet 2.3 something like this might work:
RequestDispatch rd = servletContext.getRequestDispatch( 
"/cocoon/documents/index.html?cocoon-view=content" );
rd.include( new_request_wrapper, new_response_wrapper );
new_response_wrapper should hold the xml-content.

For Cocoon in Servlet 2.2, and higher:
I want to access the Cocoon instance of the current servlet-context. I 
don't want to create another
Cocoon instance for sake of performance, and memory-consumption.

If I have to create a new Cocoon instance, I see following choices:

* create an Cocoon instance like the org.apache.cocoon.Main and try to 
grap the right configs, etc like the servlet-engine Cocoon instance. How 
could I assert to get the right configs?
* create an Cocoon instance simulating an servlet-environment.
Can you give some hints about implementing the easiest solution.

For the commandline only crawling, and indexing I see following choices:
* Implement something like the org.apache.cocoon.Main for the crawling, 
and indexing. Same here I will
grap the same config like the servlet-engine Cocoon instance.
* Additional adding an Ant wrapper:
<taskdef name="cocoon-index" 
class="org.apache.cocoon.optional.ant.CocoonIndexTask"/>
<cocoon-index
  index-directory="/a/c/index"
  create="yes"
  analyzer="org.apache.lucene.analyzer.StandardAnalyzer"
  uri="index.html"
  contextDir="${build.context}"
  destDir="${build.dir}/ant-test/docs"
  workDir="${build.dir}/ant-test/work"
  logLevel="INFO">
</cocoon-index>

* Now should there be some Cocoon Ant datatype for making it more easy 
to create an Cocoon instance? like:
  <cocoon-index
    index-directory="/a/c/index"
    create="yes"
    analyzer="org.apache.lucene.analyzer.StandardAnalyzer"
    uri="index.html">
  <cocoon
    contextDir="${build.context}"
    destDir="${build.dir}/ant-test/docs"
    workDir="${build.dir}/ant-test/work"
    logLevel="INFO"/>
  </cocoon-index>
 
* Apropos Ant wrapper I was implementing an Ant wrapper for the Main 
class by extending the Ant class Java, and it works fine, calling the 
Main.main() from a forked java.
Thus creating the cocoon documents:
...
    <taskdef name="cocoon" 
classname="org.apache.cocoon.optional.ant.CocoonJavaTask">
      <classpath>
        <path refid="classpath"/>
      </classpath>
    </taskdef>
   
    <cocoon
      contextDir="${build.context}"
      destDir="${build.dir}/ant-test/docs"
      workDir="${build.dir}/ant-test/work"
      logLevel="INFO"
      uri="index.html"
    >
      <classpath>
        <path refid="classpath"/>
      </classpath>
    </cocoon>
...
But I failed to call it setting fork=false, getting some 
ClassNotFoundException. Now I wonder the ServletEngine has solved this 
somehow....

* Having a command line, or Ant wrapped indexing, and crawling the last 
open issues is to invoke that via some time-service, some 
ApplicationServer like WLS offers that, and I think there is some 
Cron-Service in the Avalon-System. Does it makes sense to add the 
Avalon-Cron service into a simple Servlet-Engine?

>searching for 'cocoon' would result in something like:
>
> <search:results>
>  <search:hit rank="1" score="89%" uri="...">
>   <xhtml:p>
>    <search:highlight>Cocoon</search:highlight> now offers semantic    
><search:highlight>search</search:highlight>
>   </xhtml:p>
>  </search:hit>
>  ...
> </search:results>
>
>As you can see, this also includes part of the "context" where the
>textual information is found. This follows the Google model and I think
>it would be a *great* feature to have.
>
This is possible if you change the lucene API a bit.
There was some posting in lucene mailing list regarding highlightning. I 
don't know about the state of that approvement. Anyway the highlightning 
needs some changes in the lucene API, i have modified "my"
lucene to be able to do highlightning.

Moreover if you want to have something like highligthning, the question 
is if the summary should be stored in the
index, too, or should we ask for the cocoon-view again, at search-time, 
to get the summary?

I have implemented the LuceneIndexContentHandler to generate no-store 
fields, body, and all the element, and attribute fields are not stored 
only indexed fields,
Now adding a summary might make it worth to add the body field as 
stored. But what about the
<s1 title="Introdcution">? The "Introduction" is not stored in the body.
How should we summarize this?

>
>But this requires more thinking, I'd say let's ignore it for now, so you
>can come up with
>
> <search:results xmlns:search="http://apache.org/cocoon/search/1.0>
>  <search:hit rank="1" score="89%" uri="..."/>
>  ...
> </search:results>
>
>which is good enough for now but could be easily improved later on.
>
bye bernhard


Mime
View raw message