Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BAY102-DS2E177C80FDB22FD10EECBBC3D0@phx.gbl>
References: <BAY102-DS2E177C80FDB22FD10EECBBC3D0@phx.gbl>
Date: Thu, 18 Jun 2009 10:55:05 -0700
Message-ID: <623d9cf40906181055i300cbed2t5469664694e57614@mail.gmail.com>
Subject: Re: Read/write dependency wrt total data size on hdfs
From: Alex Loddengaard <alex@cloudera.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016369201f09e571d046ca31b88

--0016369201f09e571d046ca31b88
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

I'm a little confused what you're question is.  Are you asking why HDFS has
consistent read/write speeds even as your cluster gets more and more data?

If so, two HDFS bottlenecks that would change read/write performance as used
capacity changes are name node (NN) RAM and the amount of data each of your
data nodes (DNs) are storing.  If you have so much meta data (lots of files,
blocks, etc.) that the NN java process uses most of your NN's memory, then
you'll see a big decrease in performance.  This bottleneck usually only
shows itself on large clusters with tons of metadata, though a small cluster
with a wimpy NN machine will have the same bottleneck.  Similarly, if each
of your DNs are storing close to their capacity, then reads/writes will
begin to slow down, as each node will be responsible for streaming more and
more data in and out.  Does that make sense?

You should fill your cluster up to 80-90%.  I imagine you'd probably see a
decrease in read/write performance depending on the tests you're running,
though I can't say I've done this performance test before.  I'm merely
speculating.

Hope this clears things up.

Alex

On Thu, Jun 18, 2009 at 9:30 AM, Wasim Bari <wasimbari@msn.com> wrote:

> Hi,
>     I am storing data on a HDFS cluster(4 machines).  I have seen that
> read/write is not very much effected with the size of data on HDFS (Total
> data size of HDFS). I have used
>
> 20-30% of cluster and didn't completely filled it.  Can someone explain me
> why its so and HDFS promises such feature or I am missing some stuff?
>
> Thanks,
>
> wasim

--0016369201f09e571d046ca31b88--