lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Hunt (JIRA)" <>
Subject [jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
Date Wed, 02 Dec 2009 19:17:20 GMT


Patrick Hunt commented on SOLR-1277:

bq. Patrick, how low is it feasible to set the timeout? Could it be set low enough that it
could be the only input to a failover decision in the case of a very high query load? That
is, say a cluster with 3 query slaves is handling 600 queries per second, which means each
is getting 200qps, or one every 5ms on average. If a slave were to fail, queries will start
backing up pretty quickly unless a decision is made to drop the failed node within 500ms or
so. Clearly, whatever node is distributing the queries to the slaves can make the failed node
down (say, in the case of a HW load balancer), but could we rely on ZK to handle this for

See for background

Typically you will have a server ticktime of 2 seconds, so min that the server allows currently
is 4 seconds. This means that the client will send a ping every 4/3 seconds, waiting up to
4/3 seconds for a response before it considers the server down. The server of course will
expire the session after 4 seconds in this case.

It should work (say 601 is fixed) but I would not encourage you to go down this road, instead
you can do something better (although I don't know enough about solr, perhaps this is worse,
it may also depend on whether/what hw load balancer you have)

Rather I would suggest that you do something similar to the lease - periodically publish some
load information from the query slaves to zk. Every 250ms your query slave could push an update
that says "I am doing Xqps currentl" If you don't see an update in 500ms maybe you consider
the slave dead till it comes back (updates the znode again). If you don't have a hwLB you
might even be able to take advantage of this information when passing queries to slaves. Worst
case scenario you could expose this information through a dashboard, giving good insight into
solr workings to an operator.

Each slave is doing 4 updates to zk per second in this case. You are more reliant on having
a stable framework for ZK, keep that in mind (the cluster must be performant, low gc pauses
in zk itself (ie tune the gc properly) etc...)

See my zk service latency review for what you should expect re latencies in some situations:

> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>                 Key: SOLR-1277
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, SOLR-1277.patch,
>   Original Estimate: 672h
>  Remaining Estimate: 672h
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message