Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 19241 invoked from network); 18 Jun 2009 17:55:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Jun 2009 17:55:27 -0000 Received: (qmail 89657 invoked by uid 500); 18 Jun 2009 17:55:36 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 89574 invoked by uid 500); 18 Jun 2009 17:55:36 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 89564 invoked by uid 99); 18 Jun 2009 17:55:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jun 2009 17:55:36 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.46.30] (HELO yw-out-2324.google.com) (74.125.46.30) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jun 2009 17:55:26 +0000 Received: by yw-out-2324.google.com with SMTP id 9so546077ywe.29 for ; Thu, 18 Jun 2009 10:55:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.227.5 with SMTP id z5mr1339552ang.175.1245347705673; Thu, 18 Jun 2009 10:55:05 -0700 (PDT) In-Reply-To: References: Date: Thu, 18 Jun 2009 10:55:05 -0700 Message-ID: <623d9cf40906181055i300cbed2t5469664694e57614@mail.gmail.com> Subject: Re: Read/write dependency wrt total data size on hdfs From: Alex Loddengaard To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016369201f09e571d046ca31b88 X-Virus-Checked: Checked by ClamAV on apache.org --0016369201f09e571d046ca31b88 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I'm a little confused what you're question is. Are you asking why HDFS has consistent read/write speeds even as your cluster gets more and more data? If so, two HDFS bottlenecks that would change read/write performance as used capacity changes are name node (NN) RAM and the amount of data each of your data nodes (DNs) are storing. If you have so much meta data (lots of files, blocks, etc.) that the NN java process uses most of your NN's memory, then you'll see a big decrease in performance. This bottleneck usually only shows itself on large clusters with tons of metadata, though a small cluster with a wimpy NN machine will have the same bottleneck. Similarly, if each of your DNs are storing close to their capacity, then reads/writes will begin to slow down, as each node will be responsible for streaming more and more data in and out. Does that make sense? You should fill your cluster up to 80-90%. I imagine you'd probably see a decrease in read/write performance depending on the tests you're running, though I can't say I've done this performance test before. I'm merely speculating. Hope this clears things up. Alex On Thu, Jun 18, 2009 at 9:30 AM, Wasim Bari wrote: > Hi, > I am storing data on a HDFS cluster(4 machines). I have seen that > read/write is not very much effected with the size of data on HDFS (Total > data size of HDFS). I have used > > 20-30% of cluster and didn't completely filled it. Can someone explain me > why its so and HDFS promises such feature or I am missing some stuff? > > Thanks, > > wasim --0016369201f09e571d046ca31b88--