hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: ioreply
Date Mon, 30 Apr 2012 16:05:43 GMT
It definitely sounds interesting, kind of like gridmix.  I think that there are three big issues

  1. Where are you going to store all of the data, or are you just going to generate random
data?  If it is random data then you can do this almost totally form an anonymised version
of the audit logs (need something to store the lengths of the writes/reads probably on the
DataNodes themselves).
  2. How are you going to deal with multiple machines and network saturation?  A typical Hadoop
cluster is going to have accesses from many many different machines and in aggregate is likely
to saturate the network connection from any single box.  You will need a way to replay this
from many different machines, probably all machines in the cluster, preferably in at least
a slightly coordinated way.
  3.  I assume that for most clusters HDFS is primarily accessed from within that cluster,
by MapReduce jobs, yes there is a lot of Hbase too and there are probably similar problems
with that.  The JobTracker/ResourceManager tries very hard to put jobs close to the data,
the same rack most of the time, and the same node some of the time.  Because HDFS is not deterministic
in how to assigns blocks we are likely to see very different performance characteristics with
respect to the locality of accesses when replaying a log, then we are on the original.  I
don't think that this is super critical, but if this is not addressed and we optimize for
these benchmarks we are likely going to optimize more for remote accesses then a typical cluster

I think it is a great idea, it is just going to be a lot of work to get it right.

--Bobby Evans

On 4/28/12 8:15 PM, "Colin McCabe" <cmccabe@alumni.cmu.edu> wrote:

Here is an interesting idea: recording traces of the filesystem
operations applications do, and allowing these traces to be replayed

> ioreplay is mainly intended for replaying of recorded (using strace) IO traces, which
is useful for standalone
> benchmarking. It provides many features to ensure validity of such measurements.


Sounds like something we should consider doing for HDFS performance testing...


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message