hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Veksler <ma...@vekslers.org>
Subject Re: Deployment architecture for Hadoop, HBase & Hive recommendations?
Date Mon, 09 Aug 2010 08:34:09 GMT
Thank you very much guys.
Great input.

No further questions :)

Maxim.

On Wed, Aug 4, 2010 at 7:51 AM, Jeff Hammerbacher <hammer@cloudera.com>wrote:

> Hey Maxim,
>
> Very cool stuff, and J-D definitely hit the high notes.. For a cluster
> that's going to do real work, unless you're sold on AWS for all of your
> infrastructure, I'd definitely recommend real hardware from a vendor like
> Supermicro or moving to a "bare metal" cloud environment like SoftLayer.
> More notes on hardware setup at
>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> .
>
> Best of luck,
> Jeff
>
> On Tue, Aug 3, 2010 at 4:08 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Sorry took a day to answer, see inline.
> >
> > J-D
> >
> > On Mon, Aug 2, 2010 at 10:47 AM, Maxim Veksler <maxim@vekslers.org>
> wrote:
> > > Hello,
> > >
> > > We're setting up a data warehouse environment that includes Hadoop,
> > HBase,
> > > Hive and our own in-house MR jobs.
> > > I would like with your permission to discuss the architecture we should
> > > choose for this.
> >
> > Cooool.
> >
> > >
> > > Today we process ~10GB of data per day.
> > > Trying to balance between performance & consolidation, would you
> consider
> > > the following setup reasnoble?
> > >
> > >
> > > EC2 m1.large (amd64bit, 7.5GB RAM, 400GB HD).
> > > EC2 m1.small (intel x86, 1.7GB RAM, 160GB HD).
> >
> > If you are planning on doing any kind of MapReduce, large instances
> > won't be enough. We recommend that HBase is given 4-6GB of heap, which
> > is already really tight with m1.large.
> >
> > >
> > >
> > > Cluster components:
> > >
> > > 1:[NameNode], 1:[SecondaryNameNode], 1:[JobTracker], n:[DataNode]
> > > n:[TaskTracker], 1:[HBaseMaster], n:[HBaseRegionServer],
> > 2*n+1:[ZooKeeper]
> > >
> > >
> > > Planned setup :
> > >
> > > m1.large NodeM1 "master" : [NameNode], [SecondaryNameNode],
> [HBaseMaster]
> > > m1.small NodeZ1 "zoo1" : [ZooKeeper]
> > > m1.small NodeZ2 "zoo2" : [ZooKeeper]
> > > m1.small NodeZ3 "zoo3" : [ZooKeeper]
> > > m1.large NodeS1 "slave1" : [DataNode], [TaskTracker],
> [HBaseRegionServer]
> > > m1.large NodeS2 "slave2" : [DataNode], [TaskTracker],
> [HBaseRegionServer]
> > > m1.large NodeS3 "slave3" : [DataNode], [TaskTracker],
> [HBaseRegionServer]
> > > m1.large NodeS4 "slave4" : [DataNode], [TaskTracker],
> [HBaseRegionServer]
> > > m1.large NodeS4 "slave5" : [DataNode], [TaskTracker],
> [HBaseRegionServer]
> > >
> > > I'm having second thoughts about:
> > >
> > > - Zookeepers on separate machines (why not run them on a slave1,
> slave3,
> > > slave4 for ex.) ?
> >
> > HBase uses zookeeper mostly to do cluster membership management, and
> > ZK requires fast (hopefully dedicated) IO. The slave nodes are usually
> > IO hungry. This isn't compatible.
> >
> > Also your cluster will be as reliable as your master machine, so
> > having 3 independent nodes for ZK doesn't really makes sense. At your
> > level, I would just put a standalone zookeeper server on the master
> > machine.
> >
> > > - Do I really need the SecondaryNameNode? Can I disable it completely
> or
> > > should I get another 1 instance and perhaps run it with a zoo keeper
> > (while
> > > the other 2 zoo keepers will remain small instances) ?
> >
> > The SecondaryNameNode isn't a backup NameNode, it's really part of the
> > Namenode. See http://wiki.apache.org/hadoop/FAQ#A7
> >
> > > - Is it wise to run Hadoop & HBase slaves on the same instance or
> should
> > > I separate them ?
> >
> > It's the best thing to do, see
> > http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
> >
> > >
> > > Also, how much resources (RAM, I/O) should I be giving each resource?
> > Some
> > > things are clear like: Make the Data directory of Hadoop on several
> block
> > > devices for efficient I/O but others are not: Is HBase CPU of RAM
> bound?
> > > Will hadoop benefit from lots of RAM?
> >
> > HBase is a database, give all the RAM you can. Also it's often
> > IO-bound, especially on EC2 because the IO is so poor. The datanode
> > and tasktracker doesn'T really require much more than the defaults. If
> > you run mapreduce jobs, then it depends on the jobs you are running to
> > tell what they are bound too. Usually it's IO, and IO is poor on EC2
> > (I like to repeat that because people underestimate how much slower it
> > is).
> >
> > >
> > > Architecture references will be highly appreciated :)
> >
> > There's not much more to say, it's pretty straightforward.
> >
> > >
> > > Thank you for reading,
> > > Would love to hear your thoughts on this.
> > >
> > > Maxim.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message