lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anshum Gupta <ans...@anshumgupta.net>
Subject Re: Do all SolrCloud nodes communicate with the database when indexing a collection?
Date Thu, 18 Feb 2016 07:01:03 GMT
Hi Colin,

As per when I last checked, DIH works with SolrCloud but has it's
limitations. It was designed for the non-cloud mode and is single threaded.
It runs on whatever node you set it up on and that node might not host the
leader for the shard a document belongs to, adding an extra hop for those
documents.

SolrCloud is designed for multi-threaded indexing and I'd highly recommend
you to use SolrJ to speed up your indexing. Yes, that would involve writing
some code but it would speed things up considerably.


On Wed, Feb 17, 2016 at 10:51 PM, Colin Freas <cfreas@stsci.edu> wrote:

>
> I just set up a SolrCloud instance with 2 Solr nodes & another machine
> running zookeeper.
>
> I’ve imported 200M records from a SQL Server database, and those records
> are split nicely between the 2 nodes.  Everything seems ok.
>
> I did the data import via the admin ui.  It took not quite 8 hours, which
> I guess is fine.  So, in the middle of the import I checked to see what was
> connected to the SQL Server machine.  It turned out that only the node that
> I had started the import on was actually connected to my database server.
>
> Is that the expected behavior?  Is there any way to have all nodes of a
> SolrCloud index communicate with the database during the indexing?  Would
> that speed up indexing?  Maybe this isn’t a bottleneck I should be worried
> about.
>
> Thanks,
> -Colin
>



-- 
Anshum Gupta

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message