Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <3848860.259311281466157309.JavaMail.jira@thor>
Date: Tue, 10 Aug 2010 14:49:17 -0400 (EDT)
From: "Hong Tang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-1338) Improve TestDFSIO
In-Reply-To: <8732165.255241281458477153.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896997#action_12896997 ] 

Hong Tang commented on HDFS-1338:
---------------------------------

I think the goal of TestDFSIO is to benchmark the peak HDFS throughput under typical MR usage pattern. This means:
- Files should be replicated.
- Files should be spread to nodes relatively evenly. (Run one map per node on the cluster, and writes out data evenly.)
- Locality information should be exposed to the MR framework correctly. (Should just use FileInputFormat instead of writing a side file.)
- The amount of dataset should not fit in OS buffer cache. (Configure the benchmark such that total amount of data > total RAM).
- Throughput should be aggregated as a time series and we should ignore the ramp up and cool down phase of the execution. (Output of each map should be time series of counters of bytes read so far. The reporting may calculate the max and average of the mid-1/3 of the time series).
- We should minimize the variations of MR scheduling. (Run one wave of maps, increase block size so that each map runs in at least 20 to 30 seconds). 

> Improve TestDFSIO
> -----------------
>
>                 Key: HDFS-1338
>                 URL: https://issues.apache.org/jira/browse/HDFS-1338
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>
> Currently the read test in TestDFSIO benchmark just opens a large side file and measures the read performance. The MR scheduler has no opportunity to do *any* optimization for the TestDFSIO MR application. The side-effect of this is that it is *very* hard to do any meaningful analysis of the results of the benchmark i.e. to check if node-local or rack-local or off-switch read performance improved/degraded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.