hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: why not zookeeper for the namenode
Date Fri, 19 Feb 2010 15:59:11 GMT
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch <thomas@koch.ro> wrote:
> Hi,
> yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
> From what I read, I thought, that bookkeeper would be the ideal enhancement
> for the namenode, to make it distributed and therefor finaly highly available.
> Now I searched, if work in that direction has already started and found out,
> that apparently a totaly different approach has been choosen:
> http://issues.apache.org/jira/browse/HADOOP-4539
> Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
> somebody could satisfy my curiosity:

I didn't work on that particular design, but I'll do my best to answer
your questions below:

> - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
>  seems to do a similiar job already in hbase.

HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.

> - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
>  a time, leaving the burden of all reads and writes on the one's shoulder?


> - Isn't it, that zookeeper would be more network efficient. It requires only a
>  majority of nodes to receive a change, while HADOOP-4539 seems to require
>  all backup nodes to receive a change before its persisted.

Potentially. However, "all backup nodes" is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.


> Thanks for any explanation,
> Thomas Koch, http://www.koch.ro

View raw message