hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Veksler <ma...@vekslers.org>
Subject Deployment architecture for Hadoop, HBase & Hive recommendations?
Date Mon, 02 Aug 2010 17:47:18 GMT

We're setting up a data warehouse environment that includes Hadoop, HBase,
Hive and our own in-house MR jobs.
I would like with your permission to discuss the architecture we should
choose for this.

Today we process ~10GB of data per day.
Trying to balance between performance & consolidation, would you consider
the following setup reasnoble?

EC2 m1.large (amd64bit, 7.5GB RAM, 400GB HD).
EC2 m1.small (intel x86, 1.7GB RAM, 160GB HD).

Cluster components:

1:[NameNode], 1:[SecondaryNameNode], 1:[JobTracker], n:[DataNode]
n:[TaskTracker], 1:[HBaseMaster], n:[HBaseRegionServer], 2*n+1:[ZooKeeper]

Planned setup :

m1.large NodeM1 "master" : [NameNode], [SecondaryNameNode], [HBaseMaster]
m1.small NodeZ1 "zoo1" : [ZooKeeper]
m1.small NodeZ2 "zoo2" : [ZooKeeper]
m1.small NodeZ3 "zoo3" : [ZooKeeper]
m1.large NodeS1 "slave1" : [DataNode], [TaskTracker], [HBaseRegionServer]
m1.large NodeS2 "slave2" : [DataNode], [TaskTracker], [HBaseRegionServer]
m1.large NodeS3 "slave3" : [DataNode], [TaskTracker], [HBaseRegionServer]
m1.large NodeS4 "slave4" : [DataNode], [TaskTracker], [HBaseRegionServer]
m1.large NodeS4 "slave5" : [DataNode], [TaskTracker], [HBaseRegionServer]

I'm having second thoughts about:

- Zookeepers on separate machines (why not run them on a slave1, slave3,
slave4 for ex.) ?
- Do I really need the SecondaryNameNode? Can I disable it completely or
should I get another 1 instance and perhaps run it with a zoo keeper (while
the other 2 zoo keepers will remain small instances) ?
- Is it wise to run Hadoop & HBase slaves on the same instance or should
I separate them ?

Also, how much resources (RAM, I/O) should I be giving each resource? Some
things are clear like: Make the Data directory of Hadoop on several block
devices for efficient I/O but others are not: Is HBase CPU of RAM bound?
Will hadoop benefit from lots of RAM?

Architecture references will be highly appreciated :)

Thank you for reading,
Would love to hear your thoughts on this.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message