hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: [jira] [Created] (HDFS-2129) Simplify BlockReader to not inherit from FSInputChecker
Date Wed, 20 Jul 2011 22:22:35 GMT
On Wed, Jul 20, 2011 at 3:13 PM, Keren Ouaknine <kereno@gmail.com> wrote:

> Hello,
>
> Thank you for the script! I run it and got total execution time for
> cat-ing:
>
> major minor fs_in fs_out wall user sys ctx_invol ctx_vol
> 2 17049 0 0 1.23 1.60 0.11 18 1113
> 2 17170 0 0 1.22 1.61 0.10 22 1023
> 2 17326 0 0 1.22 1.61 0.10 33 1049
> 2 17222 0 0 1.22 1.61 0.11 23 1020
> 2 18047 0 0 1.22 1.62 0.09 18 1033
> 2 18259 0 0 *1.27* 1.61 0.11 23 1068
> 2 17555 0 0 1.22 1.62 0.09 35 1018
> 2 17633 0 0 1.22 1.61 0.10 21 1036
> 2 17459 0 0 1.22 1.61 0.10 32 1059
> 2 18040 0 0 1.22 1.61 0.10 32 1043
>
>
> Using reps_per_run(50) and num_trials(10), the script cat the file 50
> times. Why not just 2 or more (from the second iteration file is in buffer
> cache).
>

I usually want to gather a lot of datapoints in order to have good
confidence on a subsequent p-test. The timers from /usr/bin/time aren't that
fine-grained, so having more repetitions is useful.


> Also, I looked at the results and found an outlier (1.27). I would assume
> execution time is longer due to load of machine at the time?
>
> Probably, or just a timer granularity issue.

Also, note that this time includes JVM startup time. So, it makes more sense
to use this to cat large files - from your results, it looks like you're
catting a fairly small one. I usually use at least 128MB or 256MB, so with
REPS_PER_RUN=50, it's many GB.

I would like to get further information such as the cpu time and network
> bandwidth consumed per node for a command. Do you know if Cloudera adds
> hook points to CDH3 to measure these? Are there any other benchmarking
> scripts?
>
>
For multi-node benchmarks, we usually use the same tools as the rest of the
community - ie terasort, gridmix, etc. For micro-benchmarking specific
patches, I usually devise a one-off benchmark to exercise the code path in
question. I've occasionally found it useful to do multinode tests while
running datanodes under a profiler, or with -Xprof, as well. But, to
directly answer your question, CDH3 doesn't have any special hooks beyond
what Apache Hadoop has.

-Todd


>
> On Mon, Jul 18, 2011 at 7:55 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> For benchmarking CPU, I start a pseudo-distributed HDFS cluster, put a
>> smallish file on the local datanode (such that it fits in buffer cache),
>> and
>> then use the following script with various parameters to look at CPU usage
>> to cat the file. for example:
>>
>> $ REPS_PER_RUN=50 NUM_TRIALS=10 ./read-benchmark.sh
>> hdfs://localhost/128M-file /tmp/benchmark-results.txt
>>
>> Script:
>>
>> #!/bin/sh -x
>> set -e
>> BINDIR=$(dirname $0)
>>
>> INPUT=$1
>> OUTPUT=$2
>> NUM_TRIALS=${NUM_TRIALS:-10}
>> HADOOP=${HADOOP:-./bin/hadoop}
>> HADOOP_FLAGS=${HADOOP_FLAGS:--Dio.file.buffer.size=$[64*1024]}
>> REPS_PER_RUN=${REPS_PER_RUN:-1}
>>
>>
>>
>> HEADER="major\tminor\tfs_in\tfs_out\twall\tuser\tsys\tctx_invol\tctx_vol\n"
>> TIME_FORMAT="%F\t%R\t%I\t%O\t%e\t%U\t%S\t%c\t%w"
>>
>> ! test -f $OUTPUT && printf $HEADER > $OUTPUT
>> for x in `seq 1 $NUM_TRIALS` ; do
>>    /usr/bin/time --append -o $OUTPUT -f $TIME_FORMAT \
>>        $HADOOP fs $HADOOP_FLAGS -cat $(for rep in $(seq 1 $REPS_PER_RUN) ;
>> do echo $INPUT ; done) > /dev/null
>> done
>>
>>
>> On Wed, Jul 6, 2011 at 1:16 AM, Keren Ouaknine <kereno@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > I am working on the optimization of task scheduling for Hadoop and would
>> > like to benchmark with* Apache Hadoop's standards benchmarks*. So far, I
>> > used my own scripts to measure and monitor. Where can I find the
>> > benchmarking you are referring to please?
>> >
>> > Thanks,
>> > Keren
>> >
>> > On Wed, Jul 6, 2011 at 7:32 AM, Todd Lipcon (JIRA) <jira@apache.org>
>> > wrote:
>> >
>> > > Simplify BlockReader to not inherit from FSInputChecker
>> > > -------------------------------------------------------
>> > >
>> > >                 Key: HDFS-2129
>> > >                 URL: https://issues.apache.org/jira/browse/HDFS-2129
>> > >             Project: Hadoop HDFS
>> > >          Issue Type: Sub-task
>> > >          Components: hdfs client
>> > >            Reporter: Todd Lipcon
>> > >            Assignee: Todd Lipcon
>> > >
>> > >
>> > > BlockReader is currently quite complicated since it has to conform to
>> the
>> > > FSInputChecker inheritance structure. It would be much simpler to
>> > implement
>> > > it standalone. Benchmarking indicates it's slightly faster, as well.
>> > >
>> > > --
>> > > This message is automatically generated by JIRA.
>> > > For more information on JIRA, see:
>> > http://www.atlassian.com/software/jira
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> > Keren Ouaknine
>> > Cell: +972 54 2565404
>> > Web: www.kereno.com
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Keren Ouaknine
> Cell: +972 54 2565404
> Web: www.kereno.com
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message