hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng Honghua (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5487) Generic framework for Master-coordinated tasks
Date Thu, 10 Oct 2013 14:12:42 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791516#comment-13791516

Feng Honghua commented on HBASE-5487:

bq.Master is the Actor. Having it go across a network to get/set the 'state' in a service
that is non-transactional wasn't our smartest move.
Regionservers currently report state via ZK. Master reads it from ZK. Would be better if RS
just reported directly to RS.
[~stack] Yes, this is exactly what I proposed in HBASE-9726 :-)

bq.I am wondering whether it makes sense to update the meta table from the various regionservers
on the region state changes or go via the master.. But maybe the master doesn't need to be
a bottleneck if possible. A regionserver could first update the meta table, and then just
notify the master that a certain transition was done; the master could initiate the next transition
[~devaraj] It would be better to let master updates the meta table rather than let various
regionservers do it. Master being the single actor and truth-maintainer can avoid many tricky
bugs/problems. And for frequent state changes to the meta table, the regionserver serving
the (state) meta table would be sooner the bottleneck than master which issues the update
requests, so whether it doesn't matter the update requests are from the master or from various

bq.I prefer not to use ZK since it's kind of the root cause of uncertainty: has the master/region
server got/processed the event? has the znode hijacked since master/region server changes
its mind?
We should store the state in meta table which is cached in the memory.
Whether to use coprocessor it is not a big concern to me. If we don't use coprocessor, I prefer
to use the master as the proxy to do all meta table updates. Otherwise, we need to listen
to something for updates.
[~jxiang] Agree. IMO ZK alone is not the root cause of uncertainty, the current usage pattern
of ZK is the root cause, the pattern that regionserver updates state in ZK and master listens
to the ZK and updates states in its local memory accordingly exhibits too many tricky scenarios/bugs
due to ZK watch is one-time(which can result in missed state transition) and the notification/process
is asyncronous(which can lead to delayed/non-update-to-date state in master memory). And by
replacing ZK with meta table, we also need to discard this 'RS updates - master listen' pattern
since meta table inherently lack listen-notify mechanism:-).

bq.I think ZK got a bad reputation not on its own merit, but on how we use it.
I can see that problems exist but IMHO advantages outweigh the disadvantages compared to system
Co-located system table, I am not so sure, but so far there's no even high-level design for
this (for example - do all splits have to go thru master/system table now? how does it recover?
Perhaps we should abstract an async persistence mechanism sufficiently and then decide. Whether
it would be ZK+notifications, or system table, or memory + wal, or colocated system table,
or what.
The problem is that the usage inside master of that interface would depend on perf characteristics.
Anyway, we can work out the state transitions/concurrency/recovery without tying 100% to particular
[~sershe] Agree on "ZK got a bad reputation not on its own merit, but on how we use it.",
especially if you mean currently master relies on ZK watch/notification to maintain/update
master's in-memory region state. IMO this is almost the biggest root cause of current assignment
design. If we just uses ZK the same way as using meta table to storing states, it makes no
that big difference to store the states in ZK or meta table, right(except using meta table
can have much better performance for restart of a big cluster with large amount of regions)?
But using ZK's update/listen pattern does make the difference.

bq.btw, any input on actor model? 
Things queue up operations/notifications ("ops") for master; "AM" runs on timer or when queue
is non-empty, having as inputs, cluster state (incl. ongoing internal actions it ordered before
e.g. OPENING state for a region) plus new ops from queue, on a single thread; generates new
actions (not physically doing anything e,g, talking to RS); the ops state and cluster state
is persisted; then actions are executed on different threads (e.g. messages sent to RS-es,
etc.), and "AM" runs again, or sleeps for some time if ops queue is empty.
That is a different model, not sure if it scales for large clusters.
[~sershe] "operations/notifications" means RS responses action progress to master? Master
is the single point to update the state "truth"(to meta table) and RS doesn't know where the
states are stored and doesn't access them directly, right? I think a communication/storage
diagram can help a lot for an overall clear understanding here:-)

> Generic framework for Master-coordinated tasks
> ----------------------------------------------
>                 Key: HBASE-5487
>                 URL: https://issues.apache.org/jira/browse/HBASE-5487
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver, Zookeeper
>    Affects Versions: 0.94.0
>            Reporter: Mubarak Seyed
>            Priority: Critical
>         Attachments: Region management in Master.pdf
> Need a framework to execute master-coordinated tasks in a fault-tolerant manner. 
> Master-coordinated tasks such as online-scheme change and delete-range (deleting region(s)
based on start/end key) can make use of this framework.
> The advantages of framework are
> 1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for master-coordinated
> 2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
> 3. Easy to plugin new master-coordinated tasks without adding code to core components

This message was sent by Atlassian JIRA

View raw message