lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: solr multicore vs sharding vs 1 big collection
Date Sun, 02 Aug 2015 02:06:08 GMT
On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> I currently have a single collection with 40 million documents and index
> size of 25 GB. The collections gets updated every n minutes and as a result
> the number of deleted documents is constantly growing. The data in the
> collection is an amalgamation of more than 1000+ customer records. The
> number of documents per each customer is around 100,000 records on average.
> 
> Now that being said, I 'm trying to get an handle on the growing deleted
> document size. Because of the growing index size both the disk space and
> memory is being used up. And would like to reduce it to a manageable size.
> 
> I have been thinking of splitting the data into multiple core, 1 for each
> customer. This would allow me manage the smaller collection easily and can
> create/update the collection also fast. My concern is that number of
> collections might become an issue. Any suggestions on how to address this
> problem. What are my other alternatives to moving to a multicore
> collections.?
> 
> Solr: 4.9
> Index size:25 GB
> Max doc: 40 million
> Doc count:29 million
> 
> Replication:4
> 
> 4 servers in solrcloud.

Creating 1000+ collections in SolrCloud is definitely problematic.  If
you need to choose between a lot of shards and a lot of collections, I
would definitely go with a lot of shards.  I would also want a lot of
servers for an index with that many pieces.

https://issues.apache.org/jira/browse/SOLR-7191

I don't think it would matter how many collections or shards you have
when it comes to how many deleted documents are in your index.  If you
want to clean up a large number of deletes in an index, the best option
is an optimize.  An optimize requires a large amount of disk I/O, so it
can be extremely disruptive if the query volume is high.  It should be
done when the query volume is at its lowest.  For the index you
describe, a nightly or weekly optimize seems like a good option.

Aside from having a lot of deleted documents in your index, what kind of
problems are you trying to solve?

Thanks,
Shawn


Mime
View raw message