lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Some advice on scalability
Date Thu, 15 May 2008 19:54:25 GMT

Quick feedback:

1) use 1.3-dev or 1.3 when it comes out, not 1.2

2) you did not mention Solr's distributed search functionality explicitly, so I get a feeling
you are not aware of it.  See DistributedSearch page on the Solr wiki

3) you definitely don't want a single 500M docs index that's 2TB in size - think about the
index size : RAM amount ratio

4) you can try logically sharding your index, but I suspect that will result in uneven term
distribution that will not yield optimal relevancy-based ordering.  Instead, you may have
to assign records/documents to shards in some more random fashion (see ML archives for some
recent discussion on this (search for MD5 and SHA-1 -- Lance, want to put that on the Wiki?)

5) Hardware recommendations are hard to do.  While people may make suggestions, the only way
to know how *your* hardware works with *your* data and *your* shards and *your* type of queries
is by benchmarking.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: William Pierce <>
> To:
> Sent: Thursday, May 15, 2008 12:23:03 PM
> Subject: Some advice on scalability
> Folks:
> We are building a search capability into our web and plan to use Solr.  While we 
> have the initial prototype version up and running on Solr 1.2,  we are now 
> turning our attention to sizing/scalability.  
> Our app in brief:  We get merchant sku files (in either xml/csv) which we 
> process and index and make available to our site visitors to search.   Our 
> current plan calls for us to support approx 10,000 merchants each with an 
> average of 50,000 sku's.   This will make a total of approx 500 Million SKUs.  
> In addition,  we assume that on a daily basis approx 5-10% of the SKUs need to 
> be updated (either added/deleted/modified).   (Assume each sku will be approx 
> 4K)
> Here are a few questions that we are thinking about and would value any insights 
> you all may have:
> a) Should we have just one giant master index (containing all the sku's) and 
> then have multiple slaves to handle the search queries?    In this case, the 
> master index will be approx 2 TB in size.  Not being an expert in solr/lucene,  
> I am thinking that this may be a bad idea to let one index become so large.  
> What size limit should we assume for each index?
> b) Or, should we partition the 10,000 merchants into N buckets and have a master 
> index for each of the N buckets?   We could partition the merchants depending on 
> their type or some other simple algorithm.   Then,  we could have slaves setup 
> for each of the N masters.  The trick here will be to partition the merchants 
> carefully.  Ideally we would like a search for any product type to hit only one 
> index but this may not be possible always.   For example, a search for "Harry 
> Potter" may result in hits in "books", "dvds", "memorabilia", etc etc.  
> With N masters we will have to plan for having a distributed search across the N 
> indices (and then some mechanism for weighting the results across the results 
> that come back).   Any recommendations for a distributed search solution?   I 
> saw some references to Katta.  Is this viable?
> In the extreme case, we could have one master for each of the merchants (if 
> there are 10000 merchants there will be 10,000 master indices).   The advantage 
> here is that indices will have to be updated only for every merchant who submits 
> a new data file.  The others remain unchanged.
> c) By the way,  for those of you who have deployed solr on a production 
> environment can you give me your hardware configuration and the rough number of 
> search queries that can be handled per second by a single solr instance -- 
> assuming a dedicated box?
> d) Our plan is to release a beta version Spring 2009.  Should we plan on using 
> Solr 1.2 or else move to solr 1.3 now?
> Any insights/thoughts/whitepapers will be greatly appreciated!
> Cheers,
> Bill

View raw message