lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <erik.hatc...@gmail.com>
Subject Re: Speeding up indexing
Date Tue, 28 Feb 2012 20:33:59 GMT
30 million - that's feasible on a single (beefy) Solr server.... but whether it's advisable
to go distributed or not depends on other factors, like query speed issues you may have with
that many docs in a single server, expected collection growth, and so on.

As for your questions further below -

1. Sending multiple docs into Solr definitely can help improve indexing throughput, up to
the limits of what your environment can handle of course (there are many variables here, how
many connections can your server handle at once, how much effort/memory is involved in indexing
your documents parsing and analysis-wise, etc)

2. There's Solr configuration tweaks for sure (see Solr's example solrconfig.xml for details)
that affect indexing performance, but it all depends on the bottlenecks in determining whether
any of those settings would be an improvement or a detriment.

3. If you're going to index in parallel, which of course is architecturally possible, then
you're basically setting it up for distributed search.  It's possible to merge indexes (on
the same server) but for your particular case that doesn't seem like an architectural recommendation
I'd make.  

I'd stick to a single server and see if/where that has issues, parallelize your indexing,
and consider Solr's distributed search as needed from there.

	Erik



On Feb 27, 2012, at 14:36 , Memory Makers wrote:

> A quick add on to this -- we have over 30 million documents.
> 
> I take it that we should be looking @ Distributed Solr?
>  as in
> http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e344
> 
> Thanks.
> 
> On Mon, Feb 27, 2012 at 2:33 PM, Memory Makers <memmakersorg@gmail.com>wrote:
> 
>> Many thanks for the response.
>> 
>> Here is the revised questions:
>> 
>> For example if I have N processes that are producing documents to index:
>> 1. Should I have them simultaneously submit documents to Solr (will this
>> improve the indexing throughput)?
>> 2. Is there anything I can do Solr configuration wise that will allow me
>> to speed up indexing
>> 3. Is there an architecture where I can have two (or more) solr server do
>> indexing in parallel
>> 
>> Thanks.
>> 
>> On Mon, Feb 27, 2012 at 1:46 PM, Erik Hatcher <erik.hatcher@gmail.com>wrote:
>> 
>>> Yes, absolutely.  Parallelizing indexing can make a huge difference.  How
>>> you do so will depend on your indexing environment.  Most crudely, running
>>> multiple indexing scripts on different subsets of data up to the the
>>> limitations of your operating system and hardware is how many do it.
>>> SolrJ has some multithreaded facility, as does DataImportHandler.
>>> Distributing the indexing to multiple machines, but pointing all to the
>>> same Solr server, is effectively the same as multi-threading it.... push
>>> documents into Solr from wherever as fast as it can handle it.  This is
>>> definitely how many do this.
>>> 
>>>       Erik
>>> 
>>> On Feb 27, 2012, at 13:24 , Memory Makers wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Is there a way to speed up indexing by increasing the number of threads
>>>> doing the indexing or perhaps by distributing indexing on multiple
>>> machines?
>>>> 
>>>> Thanks.
>>> 
>>> 
>> 


Mime
View raw message