lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: indexing best practices
Date Mon, 19 Jul 2010 19:43:36 GMT
Hi Ken,

This is all very dependent on your documents, your indexing setup and your hardware. Just
as an extreme data point, I'll describe our experience.  

We run 5 clients on each of 6 machines to send documents to Solr using the standard http xml
process.  Our documents contain about 10 fields, but one field contains OCR for the full text
of a book.  The documents are about 700KB in size.

Each client sends solr documents to one of 10 solr shards on a round-robin basis.  We are
running 5 shards on each of two dedicated indexing machines each with 144GB of memory and
2 x Quad Core Intel Xeon E5540 2.53GHz processors (Nehalem).  What we generally see is that
once the index gets large enough for significant merging, our producers can send documents
to solr faster than it can index them.

We suspect that our bottleneck is simply disk I/O for index merging on the Solr build machines.
 We are currently experimenting with changing the maxRAMBufferSize settings and various merge
policies/merge factors to see if we can speed up the Solr end of the indexing process.   Since
we optimize our index down to two segments, we are also planning to experiment with using
the "nomerge" merge policy. I hope to have some results to report on our blog sometime in
the next  month or so.

Tom Burton-West
www.hathitrust.org/blogs

-----Original Message-----
From: kenf_nc [mailto:ken.foster@realestate.com] 
Sent: Sunday, July 18, 2010 8:18 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing best practices


No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message