cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Quinn <>
Subject Re: Lucene index building
Date Tue, 18 Mar 2003 10:32:39 GMT

On Monday, March 17, 2003, at 11:47 PM, Upayavira wrote:

> I have built a site which I want to index with Lucene.
> I am using the create-index.xsp file in the $COCOON-ROOT/search  
> directory to
> build my index.
> I have added the following to cocoon.xconf:
>   <cocoon-crawler logger="">
>     <exclude>.*/search/.*</exclude>
>     <link-view-query>cocoon-view=lucene-links</link-view-query>
>   </cocoon-crawler>
>   <lucene-xml-indexer logger="">
>   <store-fields>body</store-fields>
>     <content-view-query>cocoon-view=lucene-content</content-view-query>
>   </lucene-xml-indexer>

This all looks fine
My exclude string looks like this though :


I believe as soon as you specify an exclude string, the default values  
for images etc. are not used.

> I've set up a view lucene-links which works, giving back just links  
> from a page.
> I've set up a view lucene-content just giving back the content. The  
> content is like:
> <page>
>   <links>....list of links</links>
>   <body>... the body content ...</body>
> </page>
> I have had it partially working (indexing both links and body), but  
> now whenever I
> run create-index, it fails with a Cannot parse!:  
> org.xml.sax.SAXParseException:
> Premature end of file.
> Any ideas what I might be doing wrong?

I got problems like this, it turned out to be pages that did not return  
valid xml. Look in your logs to see if indexing stops on a particular  

I also found that I could overcome the need to provide more memory by  
stripping un-needed tags from my 'content' xml being indexed.

My content for indexing looks like this:

	<title>title gets stored, then displayed with hit</title>
	<summary>summary gets stored, then displayed with hit</summary>
	all of my body content with tags stripped out

Hope this helps

regards Jeremy

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message