hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1338) Improve TestDFSIO
Date Tue, 10 Aug 2010 22:31:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897070#action_12897070
] 

Konstantin Shvachko commented on HDFS-1338:
-------------------------------------------

DFSIO benchmark is designed to measure HDFS data transfer performance only.
TestDFSIO is not intended to benchmark typical MR usage pattern.
TestDFSIO intentionally avoids any overhead or optimizations induced by MR framework.
The MR scheduler should not be able to do any optimization for TestDFSIO.
It's a simple and straightforward benchmark, I'd prefer to keep it that way.

It seems you are talking about a different benchmark, which will allow to measure the MR framework
optimizations. This makes sense, and it is very sad that we still don't have any benchmarks
dedicated to this area, if I don't miss anything. I think DFSIO framework can be used for
this new benchmark.

What are the main objectives for the new benchmark? As Arun proposed, it should be able to
distinguish between node-local, rack-local and off-switch data transfers. Anything else?

In my view Hong's bullet points are well formulated practices of running DFSIO on a cluster
to make the results meaningful.
I'd add one thing: turn of logging.

TestDFSIO is a part of mapreduce now. So this jira should rather be filed there. We can keep
the discussion here, and create a MR jira later to commit the code once the a patch is ready.

> Improve TestDFSIO
> -----------------
>
>                 Key: HDFS-1338
>                 URL: https://issues.apache.org/jira/browse/HDFS-1338
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>
> Currently the read test in TestDFSIO benchmark just opens a large side file and measures
the read performance. The MR scheduler has no opportunity to do *any* optimization for the
TestDFSIO MR application. The side-effect of this is that it is *very* hard to do any meaningful
analysis of the results of the benchmark i.e. to check if node-local or rack-local or off-switch
read performance improved/degraded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message