hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: hdfs > 100T?
Date Thu, 10 Apr 2008 16:18:02 GMT

Hadoop also does much better with spindles spread across many machines.
Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts.
Much better to put 0.5-2TB on 16-64 machines.  With 2x1TB SATA drives, your
cost and performance are likely to both be better than two machines with
storage trays (aggressive pricing right now on minimal machines with 16TB in
two storage trays from a major vendor is about 18K$, you should be able to
populate a 1U node with 2TB of disk for about $1500.  16 x 1.5K% = 24K$ <
2x18K$).  The rack space requirements are about the same, but you may have
slightly lower power for the tray solutions.

On the other hand, your performance requirements are so low that you might
just as well off getting something like a Sun Thumper that can accommodate
all of your storage in a single chassis.

We use a mixture of both kinds of solution in our system.  We have nearly a
billion files stored on tray based machines using mogileFS.  One scaling
constraint there is the simply the management and configuration of nodes so
fewer machines is a small win.  We also have a modest number of TB's in a
more traditional hadoop cluster with small machines.

On 4/10/08 12:57 AM, "Allen Wittenauer" <aw@yahoo-inc.com> wrote:

> On 4/10/08 4:42 AM, "Todd Troxell" <ttroxell@debian.org> wrote:
>> Hello list,
>     Howdy.
>> I am interested in using HDFS for storage, and for map/reduce only
>> tangentially.  I see clusters mentioned in the docs with many many nodes and
>> 9TB of disk.
>> Is HDFS expected to scale to > 100TB?
>     We're running file systems in the 2-6PB range.
>> Does it require massive parallelism to scale to many files?  For instance, do
>> you think it would slow down drastically in a 2 node 32T config?
>     The biggest gotcha is the name node.  You need to feed it lots and lots
> of memory.  Keep in mind that Hadoop functions better with fewer large files
> than many small ones.

View raw message