lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Indexing large sets of documents?
Date Thu, 27 Jul 2006 19:41:02 GMT
Michael, 
 
Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if you want to
do this on a single machine you should tune some of the IndexWriter params.  You didn't mention
them, so I assume you didn't tune anything yet.  If you have Lucene in Action, check out
 2.7.1  : Tuning indexing performance        starts on page 42  under section 2.7 (Controlling
the indexing process) in chapter 2 (Indexing)  
 (found via: http://lucenebook.com/search?query=index+tuning )

If not, check maxBufferedDocs and mergeFactor in IndexWriter javadocs.  This is likely in
the FAQ, too, but I didn't check.

Otis

----- Original Message ---- 
From: Michael J. Prichard  
To: java-user@lucene.apache.org 
Sent: Thursday, July 27, 2006 12:29:31 PM 
Subject: Indexing large sets of documents? 
 
I built an indexer that runs through email and its attachments, rips out  
content and what not and then creates a Document and adds it to an  
index.  It works w/ no problem.  The issue is that it takes around 3-5  
seconds per email and I have seen up to 10-15 seconds for email w/  
attachments.  I need to index 750k emails and at those times it will  
take FOREVER!  I am trying to find places to cut a second or two here or  
there but are there any suggestions as to what I can do?  Should I look  
into parallelizing indexing?  Help?! 
 
Thanks, 
Michael 
 
--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 
 
 
 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message