lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: querying on shards
Date Wed, 21 Mar 2012 18:34:16 GMT
I'd _really_ recommend that you do not do this unless and
until it's provably necessary. As Shawn says, the load on the
shards that return nothing will probably be very low. And
this is the kind of thing that one spends endless hours
debugging. Somehow, sometime, I flat guarantee you'll
be trying to figure you why your results aren't what you expect
and
1> you'll find out your algorithm for distributing the queries
     is not querying the right shard
2> your indexing process didn't put the docs on the shard you
     thought.
3> you changed your indexing distribution and now some docs
     are on shards you're not querying.
4> you fixed the problem in <3> on all but one place in the code...

and maybe all of the above at once... <G>...

This really smells like premature optimization.

Best
Erick

On Tue, Mar 20, 2012 at 10:37 AM, Shawn Heisey <solr@elyograg.org> wrote:
> On 3/19/2012 11:55 PM, Ankita Patil wrote:
>>
>> Hi,
>>
>> I wanted to know whether it is feasible to query on all the shards even if
>> the query yields data only from a few shards n not all. Or is it better to
>> mention those shards explicitly from which we get the data and only query
>> on them.
>>
>> for example :
>> I have 4 shards. Now I have a query which yields data only from 2 shards.
>> So shoud I select those 2 shards only and query on them or it is ok to
>> query on all the shards? Will that affect the performance in any way?
>
>
> I use a sharded index, but I am not a seasoned Java/Solr/Lucene developer.
>  My clients do not use the shards parameter themselves - they talk to a a
> load balancer, which in turn talks to a special core that has the shards in
> its request handler config and has no index of its own.  I call it a broker,
> because that is what our previous search product (EasyAsk) called it.
>
> As I understand things, the performance of your slowest shard, whether that
> is because of index size on that shard or the underlying hardware, will be a
> large factor in the performance of the entire index.  A distributed query
> sends an identical query to all the shards it is configured for.  It gathers
> all those results in parallel and builds a final result to send to the
> client.
>
> You MIGHT get better performance by not including the other shards.  If the
> "no results" shard query returns super-fast, it probably won't really make
> any difference.  If it takes a long time to get the answer that there are no
> results, then removing them would make things go faster.  That requires
> intelligence on the client to know where the data is.  If the client does
> not know where the data is, it is safer to simply include all the shards.
>
> Thanks,
> Shawn
>

Mime
View raw message