hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Hammerbacher <ham...@cloudera.com>
Subject Re: Deployment architecture for Hadoop, HBase & Hive recommendations?
Date Wed, 04 Aug 2010 04:51:50 GMT
Hey Maxim,

Very cool stuff, and J-D definitely hit the high notes.. For a cluster
that's going to do real work, unless you're sold on AWS for all of your
infrastructure, I'd definitely recommend real hardware from a vendor like
Supermicro or moving to a "bare metal" cloud environment like SoftLayer.
More notes on hardware setup at
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
.

Best of luck,
Jeff

On Tue, Aug 3, 2010 at 4:08 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Sorry took a day to answer, see inline.
>
> J-D
>
> On Mon, Aug 2, 2010 at 10:47 AM, Maxim Veksler <maxim@vekslers.org> wrote:
> > Hello,
> >
> > We're setting up a data warehouse environment that includes Hadoop,
> HBase,
> > Hive and our own in-house MR jobs.
> > I would like with your permission to discuss the architecture we should
> > choose for this.
>
> Cooool.
>
> >
> > Today we process ~10GB of data per day.
> > Trying to balance between performance & consolidation, would you consider
> > the following setup reasnoble?
> >
> >
> > EC2 m1.large (amd64bit, 7.5GB RAM, 400GB HD).
> > EC2 m1.small (intel x86, 1.7GB RAM, 160GB HD).
>
> If you are planning on doing any kind of MapReduce, large instances
> won't be enough. We recommend that HBase is given 4-6GB of heap, which
> is already really tight with m1.large.
>
> >
> >
> > Cluster components:
> >
> > 1:[NameNode], 1:[SecondaryNameNode], 1:[JobTracker], n:[DataNode]
> > n:[TaskTracker], 1:[HBaseMaster], n:[HBaseRegionServer],
> 2*n+1:[ZooKeeper]
> >
> >
> > Planned setup :
> >
> > m1.large NodeM1 "master" : [NameNode], [SecondaryNameNode], [HBaseMaster]
> > m1.small NodeZ1 "zoo1" : [ZooKeeper]
> > m1.small NodeZ2 "zoo2" : [ZooKeeper]
> > m1.small NodeZ3 "zoo3" : [ZooKeeper]
> > m1.large NodeS1 "slave1" : [DataNode], [TaskTracker], [HBaseRegionServer]
> > m1.large NodeS2 "slave2" : [DataNode], [TaskTracker], [HBaseRegionServer]
> > m1.large NodeS3 "slave3" : [DataNode], [TaskTracker], [HBaseRegionServer]
> > m1.large NodeS4 "slave4" : [DataNode], [TaskTracker], [HBaseRegionServer]
> > m1.large NodeS4 "slave5" : [DataNode], [TaskTracker], [HBaseRegionServer]
> >
> > I'm having second thoughts about:
> >
> > - Zookeepers on separate machines (why not run them on a slave1, slave3,
> > slave4 for ex.) ?
>
> HBase uses zookeeper mostly to do cluster membership management, and
> ZK requires fast (hopefully dedicated) IO. The slave nodes are usually
> IO hungry. This isn't compatible.
>
> Also your cluster will be as reliable as your master machine, so
> having 3 independent nodes for ZK doesn't really makes sense. At your
> level, I would just put a standalone zookeeper server on the master
> machine.
>
> > - Do I really need the SecondaryNameNode? Can I disable it completely or
> > should I get another 1 instance and perhaps run it with a zoo keeper
> (while
> > the other 2 zoo keepers will remain small instances) ?
>
> The SecondaryNameNode isn't a backup NameNode, it's really part of the
> Namenode. See http://wiki.apache.org/hadoop/FAQ#A7
>
> > - Is it wise to run Hadoop & HBase slaves on the same instance or should
> > I separate them ?
>
> It's the best thing to do, see
> http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
>
> >
> > Also, how much resources (RAM, I/O) should I be giving each resource?
> Some
> > things are clear like: Make the Data directory of Hadoop on several block
> > devices for efficient I/O but others are not: Is HBase CPU of RAM bound?
> > Will hadoop benefit from lots of RAM?
>
> HBase is a database, give all the RAM you can. Also it's often
> IO-bound, especially on EC2 because the IO is so poor. The datanode
> and tasktracker doesn'T really require much more than the defaults. If
> you run mapreduce jobs, then it depends on the jobs you are running to
> tell what they are bound too. Usually it's IO, and IO is poor on EC2
> (I like to repeat that because people underestimate how much slower it
> is).
>
> >
> > Architecture references will be highly appreciated :)
>
> There's not much more to say, it's pretty straightforward.
>
> >
> > Thank you for reading,
> > Would love to hear your thoughts on this.
> >
> > Maxim.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message