Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <569167B2.6050901@intelcompute.com>
References: <569167B2.6050901@intelcompute.com>
Date: Sat, 9 Jan 2016 15:44:50 -0800
Message-ID: 
 <CAN4YXvcPn+XWDw1vK-g-ZDx17X1782nTRVJnF9Rm=XJ0j256cQ@mail.gmail.com>
Subject: Re: Querying only replica's
From: Erick Erickson <erickerickson@gmail.com>
To: solr-user <solr-user@lucene.apache.org>
Content-Type: text/plain; charset=UTF-8

bq: is it best/good to get the CLUSTERSTATUS via the collection API
and explicitly send queries to a replica to ensure I don't send
queries to the leaders of my collection

In a word _no_. SolrCloud is vastly different than the old
master/slave. In SolrCloud, each and every node (leader and replicas)
index all the docs and serve queries. The additional burden the leader
has is actually very small. There's absolutely no reason to _not_ use
the leader to serve queries.

As far as sending updates, there would be a _little_ benefit to
sending the updates directly to the leader, but _far_ more benefit in
using SolrJ. If you use SolrJ (and CloudSolrClient), then the
documents are split up on the _client_ and only the docs for a
particular shard are automatically sent to the leader for that shard.
Using SolrJ you can essentially scale indexing linearly with the
number of shards you have. Just using HTTP does not scale linearly.
Your particular app may not care, but in high-throughput situations
this can be significant.

So rather than spend time and effort sending updates directly to a
leader and have the leader then forward the docs to the correct shard,
I recommend investing the time in using SolrJ for updates rather than
sending updates to the leader over HTTP. Or just ignore the problem
and devote your efforts to something that are more valuable.

So in short:
1> just stick a load balancer in front of _all_ your Solr nodes for
queries. And note that there's an internal load balancer already in
Solr that routes things around anyway, although putting a load
balancer in front of your entire cluster makes it so there's not a
single point of failure.
2> Depending on your throughput needs, either
2a> use SolrJ to index
2b> don't worry about it and send updates through the load balancer as
well. There'll be an extra hop if you send updates to a replica, but
if that's significant you should be using SolrJ

As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
just released in early December. There's usually a several month lag
between point releases and there's some agitation to start the 6.0
release process, so it's up in the air.


On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown <rob@intelcompute.com> wrote:
> Hi,
>
> (btw, when is 5.5 due?  I see the docs reference it, but not the download
> page)
>
> Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it best/good
> to get the CLUSTERSTATUS via the collection API and explicitly send queries
> to a replica to ensure I don't send queries to the leaders of my collection,
> to improve performance?  Like-wise with sending updates directly to a
> Leader?
>
> My leaders will receive full updates of the entire collection once a day, so
> I would assume if the leader is handling queries too, performance would be
> hit?
>
> Is the CLUSTERSTATUS API the only way to do this btw without SolrJ, etc.?  I
> wasn't sure if ZooKeeper would be able to tell me also.
>
> Do I also need to do anything to ensure the leaders are never sent queries
> from the replica's?
>
> Does this all sound sane?
>
> One of my collections is 3 shards, with 2 replica's each (9 total nodes),
> 70m docs in total.
>
> Thanks,
> Rob
>