lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Øie <k...@gan.no>
Subject Re: indexing performance of little documents
Date Fri, 01 Apr 2005 13:47:37 GMT
This might sound a bit lame but it has worked for me. I have had the 
same problem where the amount of small lucene documents slows down the 
building of large indexes.

Search is pretty fast, and read only, so for my case i just created 
three indexes and saved every three lucene documents into one of each 
index. then upon a search i merge the results from the three smaller 
indexes. Only thing to consider is to store all parts of a source 
document into the same index so that booleans still work. I have even 
threaded out the searching so search on the three indexes are performed 
in parallel.

By the way; Stop word filters can also do wonders for a index full of 
text too...

Mvh Karl Øie

On 1. apr. 2005, at 11.43, Fabien Le Floc'h wrote:

> Hello,
>
> I want to index a 1GB file that contains a list of lines of
> approximately 100 characters each, so that i can later get lines
> containing some particular text. The natural way of doing it with 
> lucene
> would be to create 1 lucene Document per line. It works well except it
> is too slow for my needs, even after tweaking all possible parameters 
> of
> IndexWriter and using cvs version of lucene.
>
> I can get 10x the indexing performance by indexing the file as 1 lucene
> Document. Lucene builds a good index with all the terms and I am able 
> to
> get the number of terms matching a query but not the absolute position
> in the original file (I only get the token relative position). A minor
> quirk with this approach is that i need to split the document in order
> to avoid outofmemory exception when the document is too big. It would 
> be
> probably possible for me to customize lucene for my needs (create a 
> more
> flexible Term class), that's just a hack. But I was wondering why there
> should be such a performance difference.
>
> I see that for each document plenty of work is done, but that seems
> necessary, and then there is even more work while merging segments.
> Things could probably be faster if documents were first aggregated and
> then work done on them. But I think this would imply huge changes in
> Lucene. Any advice for indexing millions of tiny docs?
>
>
>
> Regards,
>
> Fabien.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message