hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1338) Improve TestDFSIO
Date Tue, 10 Aug 2010 18:49:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896997#action_12896997
] 

Hong Tang commented on HDFS-1338:
---------------------------------

I think the goal of TestDFSIO is to benchmark the peak HDFS throughput under typical MR usage
pattern. This means:
- Files should be replicated.
- Files should be spread to nodes relatively evenly. (Run one map per node on the cluster,
and writes out data evenly.)
- Locality information should be exposed to the MR framework correctly. (Should just use FileInputFormat
instead of writing a side file.)
- The amount of dataset should not fit in OS buffer cache. (Configure the benchmark such that
total amount of data > total RAM).
- Throughput should be aggregated as a time series and we should ignore the ramp up and cool
down phase of the execution. (Output of each map should be time series of counters of bytes
read so far. The reporting may calculate the max and average of the mid-1/3 of the time series).
- We should minimize the variations of MR scheduling. (Run one wave of maps, increase block
size so that each map runs in at least 20 to 30 seconds). 

> Improve TestDFSIO
> -----------------
>
>                 Key: HDFS-1338
>                 URL: https://issues.apache.org/jira/browse/HDFS-1338
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>
> Currently the read test in TestDFSIO benchmark just opens a large side file and measures
the read performance. The MR scheduler has no opportunity to do *any* optimization for the
TestDFSIO MR application. The side-effect of this is that it is *very* hard to do any meaningful
analysis of the results of the benchmark i.e. to check if node-local or rack-local or off-switch
read performance improved/degraded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message