hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Hadoop cluster hardware details for big data
Date Wed, 06 Jul 2011 11:31:17 GMT
On 06/07/11 11:43, Karthik Kumar wrote:
> Hi,
> Has anyone here used hadoop to process more than 3TB of data? If so we
> would like to know how many machines you used in your cluster and
> about the hardware configuration. The objective is to know how to
> handle huge data in Hadoop cluster.

This is too vague a question. What do you mean "process?". Scan through 
some logs looking for values? You could do that on a single machine if 
you weren't in a rush and you have enough disks, you'd just be very IO 
bound, and to be honest HDFS needs a minimum number of machines to 
become fault tolerant. Do complex matrix operations that use lots of RAM 
and CPU? You'll need more machines.

If your cluster has a blocksize of 512MB then a 3TB file fits into 
(3*1024*1024)/512 blocks: 6144. so you can't have more than 6144 
machines anyway -that's your theoretical maximum, even if your name is 
Facebook or Yahoo!

What you are looking for is something in between 10 and 6144, the exact 
number driven by
  -how much compute you need to do, and how fast you want it done 
(controls #of CPUs, RAM)
  -how much total HDD storage you anticipate needing
  -whether you want to do leading-edge GPU work (good performance on 
some tasks, but limited work per machine)

You can use benchmarking tools like gridmix3 to get some more data on 
the characteristics of your workload, which you can then take to your 
server supplier to say "this is what we need, what can you offer?" 
Otherwise everyone is just guessing.

Remember also that you can add more racks later, but you will need to 
plan ahead on datacentre space, power and -very importantly- how you are 
going to expand the networking. Life is simplest if everything fits into 
one rack, but if you plan to expand you need to have a roadmap of how to 
connect that rack to some new ones, which means adding fast interconnect 
between different top of rack switches. You also need to worry about how 
to get data in and out fast.


View raw message