hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: Distributed Clusters
Date Thu, 08 Apr 2010 17:51:05 GMT

On Apr 7, 2010, at 10:50 PM, James Seigel wrote:

> I am new to this group, and relatively new to hadoop. 

Welcome to the community, James. :)

> I am looking at building a large cluster.  I was wondering if anyone has any best practices
for a cluster in the hundreds of nodes?

Take a look at the 'Hadoop 24/7' presentation (on the hadoop wiki preso page) I did for ApacheCon
EU last year.  It covers a lot of the "now that I have a grid, what do I do?" situations.

>  As well, has anyone had experience with a cluster spanning multiple data centers.  Is
this a bad practice? moderately bad practice?  insane?

Right now, it generally falls into the insane category unless you have REALLY REALLY REALLY
low latency and high bandwidth.  The heartbeats between nodes, issues with block placement,
etc, make it highly likely to saturate the link and/or split the cluster in multiple pieces.

> Is it better to build the 1000 node cluster in a single data center?  Do you back one
of these things up to a second data center or a different 1000 node cluster?

We're currently going with a 'multiple grids in one data center' strategy.  Our 'Source of
Truth' data is from another source, meaning we could (theoretically) rebuild the grid from
that source if we were to get decimated by dinosaurs.  [That source of truth has a much better
backup/dr strategy.]

> Sorry, I am asking crazy questions...I am just wanting to learn the meta issues and opportunities
with making clusters.

These are pretty normal questions.  We should probably create a faq or something on the wiki.

View raw message