lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Hunt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1277) Implement a Solr specific naming service (using Zookeeper)
Date Wed, 02 Dec 2009 17:26:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784908#action_12784908
] 

Patrick Hunt commented on SOLR-1277:
------------------------------------

bq. Any pointers on ways to deal with this?

>From our experience with hbase (which is the only place we've seen this issue so far,
at least to this extent) you need to think about:

1) client timeout value tradeoffs
2) effects of session expiration due to gc pause, potential ways to mitigate

for 1) there is a tradeoff (the good thing is that not all clients need to use the same timeout,
so you can tune based on the client type, you can even have multiple sessions for a single
client, each with it's own timeout) You can set the timeout higher, so if your zk client pauses
you don't get expired, however this also means that if your client crashes the session won't
be expired until the timeout expires. This means that the rest of your system will not be
notified of the change (say you are doing leader election) for longer than you might like.

for 2) you need to think about the potential failure cases and their effects. a) Say your
ZK client (solr component X) fails (the host crashes), do you need to know about this in 5
seconds, or 30sec? b) Say the host is network partitioned due to a burp in the network that
lasts 5 seconds, is this ok, or does the rest of the solr system need to know about this?
c) Say component X gc pauses for 4 minutes, do you want the rest of the system to react immed,
or consider this "ok" and just wait around for a while for X to come back.... but keep in
mind that from the perspective of "the rest of your system" you don't know the difference
between a) or b or c (etc...), from their viewpoint X is gone and they don't know why (unless
it eventually comes back)

In hbase case session expiration is expensive as the region server master will reallocate
the table (or some such). In your case the effects of X going down may not be very expensive.
If this is the case then having a low(er) session timeout for X may not be a problem. (just
deal with the session timeout when it does happen, X will eventually come back) 

If X recovery is expensive you may want to set the timeout very high. but as I said this makes
the system less responsive if X has a real problem. Another option we explored with hbase
is to use a "lease" recipe instead. Set a very high timeout, but have X update the znode (still
ephemeral) every N seconds. If the rest of the system (whoever is interested in X status)
doesn't see an update from X in T seconds, then perhaps you log a warning ("where is X?").
Say you don't see an update from X in T*2 seconds, then page the operator "warning, maybe
problems with X". Say you don't see in T*3 seconds (perhaps this is the timeout you use, in
which case the znode is removed), consider X down, cleanup and enact recovery. These are madeup
actions/times, but you can see what I'm getting at. With lease it's not "all or nothing".
You (solr) have the option to take actions based on the lease time, rather than just the znode
being deleted in the typical case (all or nothing). The tradeoff here is that it's a bit more
complicted for you - you need to implement the lease rather than just relying on the znode
being deleted - you would of course set a watch on the znode to get notified when the znode
is removed (etc...)


> Implement a Solr specific naming service (using Zookeeper)
> ----------------------------------------------------------
>
>                 Key: SOLR-1277
>                 URL: https://issues.apache.org/jira/browse/SOLR-1277
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: log4j-1.2.15.jar, SOLR-1277.patch, SOLR-1277.patch, SOLR-1277.patch,
zookeeper-3.2.1.jar
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> The goal is to give Solr server clusters self-healing attributes
> where if a server fails, indexing and searching don't stop and
> all of the partitions remain searchable. For configuration, the
> ability to centrally deploy a new configuration without servers
> going offline.
> We can start with basic failover and start from there?
> Features:
> * Automatic failover (i.e. when a server fails, clients stop
> trying to index to or search it)
> * Centralized configuration management (i.e. new solrconfig.xml
> or schema.xml propagates to a live Solr cluster)
> * Optionally allow shards of a partition to be moved to another
> server (i.e. if a server gets hot, move the hot segments out to
> cooler servers). Ideally we'd have a way to detect hot segments
> and move them seamlessly. With NRT this becomes somewhat more
> difficult but not impossible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message