forrest-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ross Gardler <rgard...@apache.org>
Subject Re: Forrest-Lucene raw files search
Date Fri, 25 Nov 2005 15:22:13 GMT
Karthik Manimaran wrote:
> Hi,
>  
> I followed the following approach to make the raw files searchable using 
> Lucene.

Thanks for this info. The problem I see with this solution is that you 
have external scripts etc. to handle the generation of the data. Perhaps 
having Forrest itself generate the necessary indexes would be better. 
How about something like this:

> Forrest uses site.xml to pass the documents to the Lucene index 
> transformer. site.xml will not have the list of all the raw files as 
> entries. In my case I wanted javadocs for a component library to be 
> placed as raw HTML files and be searchable. Hence updating site.xml 
> every time the raw HTML files change is out of the question. Hence a new 
> file site-lucene.xml that contains both site.xml and entries 
> corresponding to all the raw HTML files was created. Steps are as follows:
>  
> 1. Write a batch file (UpdateLuceneSearchList.bat) that gets the 
> recursive list of all the HTML files and writes it to a file jupd.txt. 
> Place it in the root of the folder containing the raw HTML files.
> Contents of UpdateLuceneSearchList.bat >>
> dir *.htm* /n /b /s >jupd.txt

Replace this with a sitemap entry that uses the directoryGenerator [1] 
to create an XML list of raw files you want to index.

> 2. Write a java program that takes site.xml and jupd.txt and produces a 
> new xml file site-lucene.xml. Source attached.

Replace with a pipeline that aggregates the above XML with site.xml.

> 3. Update search.xmap to enable our new site-lucene.xml to be used to 
> obtain the input

This step stays the same.

> 4. Add an entry for abs-linkmap-lucene to the pipeline in linkmap.xmap

This step stays the same.

> 5. Comment the following lines in site2book.xsl (as we generate the tags 
> in site-lucene.xml without labels)
> <!--
>       <xsl:when test="not(@label)">
>       </xsl:when>
> -->

This is a bad idea, those entries are there for a reason, commenting 
them out will affect the "normal" use of site2book.xsl in some sites 
(i.e. ones with site entries without labels).

Instead you should have a label in site-lucene.xml entries.

> 6. Create a batch file that calls UpdateLuceneSearchList.bat and 
> executes the java program to update the index.

...

> This batch file can be scheduled to call every time there are updates to 
> the raw files to keep the index updated. If this is of any help and the 
> search related info on Forrest documentation could be updated, will be 
> glad to do so.

This step is no longer needed as site-lucene.xml file would now be 
generated dynamically when required.

If you decide to implement this, patches are welcome, if you need some 
more pointers we'll do our best.

Ross

Mime
View raw message