hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Deployment architecture for Hadoop, HBase & Hive recommendations?
Date Tue, 03 Aug 2010 23:08:30 GMT
Sorry took a day to answer, see inline.

J-D

On Mon, Aug 2, 2010 at 10:47 AM, Maxim Veksler <maxim@vekslers.org> wrote:
> Hello,
>
> We're setting up a data warehouse environment that includes Hadoop, HBase,
> Hive and our own in-house MR jobs.
> I would like with your permission to discuss the architecture we should
> choose for this.

Cooool.

>
> Today we process ~10GB of data per day.
> Trying to balance between performance & consolidation, would you consider
> the following setup reasnoble?
>
>
> EC2 m1.large (amd64bit, 7.5GB RAM, 400GB HD).
> EC2 m1.small (intel x86, 1.7GB RAM, 160GB HD).

If you are planning on doing any kind of MapReduce, large instances
won't be enough. We recommend that HBase is given 4-6GB of heap, which
is already really tight with m1.large.

>
>
> Cluster components:
>
> 1:[NameNode], 1:[SecondaryNameNode], 1:[JobTracker], n:[DataNode]
> n:[TaskTracker], 1:[HBaseMaster], n:[HBaseRegionServer], 2*n+1:[ZooKeeper]
>
>
> Planned setup :
>
> m1.large NodeM1 "master" : [NameNode], [SecondaryNameNode], [HBaseMaster]
> m1.small NodeZ1 "zoo1" : [ZooKeeper]
> m1.small NodeZ2 "zoo2" : [ZooKeeper]
> m1.small NodeZ3 "zoo3" : [ZooKeeper]
> m1.large NodeS1 "slave1" : [DataNode], [TaskTracker], [HBaseRegionServer]
> m1.large NodeS2 "slave2" : [DataNode], [TaskTracker], [HBaseRegionServer]
> m1.large NodeS3 "slave3" : [DataNode], [TaskTracker], [HBaseRegionServer]
> m1.large NodeS4 "slave4" : [DataNode], [TaskTracker], [HBaseRegionServer]
> m1.large NodeS4 "slave5" : [DataNode], [TaskTracker], [HBaseRegionServer]
>
> I'm having second thoughts about:
>
> - Zookeepers on separate machines (why not run them on a slave1, slave3,
> slave4 for ex.) ?

HBase uses zookeeper mostly to do cluster membership management, and
ZK requires fast (hopefully dedicated) IO. The slave nodes are usually
IO hungry. This isn't compatible.

Also your cluster will be as reliable as your master machine, so
having 3 independent nodes for ZK doesn't really makes sense. At your
level, I would just put a standalone zookeeper server on the master
machine.

> - Do I really need the SecondaryNameNode? Can I disable it completely or
> should I get another 1 instance and perhaps run it with a zoo keeper (while
> the other 2 zoo keepers will remain small instances) ?

The SecondaryNameNode isn't a backup NameNode, it's really part of the
Namenode. See http://wiki.apache.org/hadoop/FAQ#A7

> - Is it wise to run Hadoop & HBase slaves on the same instance or should
> I separate them ?

It's the best thing to do, see
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

>
> Also, how much resources (RAM, I/O) should I be giving each resource? Some
> things are clear like: Make the Data directory of Hadoop on several block
> devices for efficient I/O but others are not: Is HBase CPU of RAM bound?
> Will hadoop benefit from lots of RAM?

HBase is a database, give all the RAM you can. Also it's often
IO-bound, especially on EC2 because the IO is so poor. The datanode
and tasktracker doesn'T really require much more than the defaults. If
you run mapreduce jobs, then it depends on the jobs you are running to
tell what they are bound too. Usually it's IO, and IO is poor on EC2
(I like to repeat that because people underestimate how much slower it
is).

>
> Architecture references will be highly appreciated :)

There's not much more to say, it's pretty straightforward.

>
> Thank you for reading,
> Would love to hear your thoughts on this.
>
> Maxim.
>

Mime
View raw message