hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Virajith Jalaparti <virajit...@gmail.com>
Subject Re: Intermediate data size of Sort example
Date Wed, 29 Jun 2011 14:48:31 GMT
Great, that makes a lot of sense now! Thanks a lot Harsh!

A related question: what does REDUCE_SHUFFLE_BYTES represent? Is it the size
of the sorted output of the shuffle phase?

Thanks,
Virajith

On Wed, Jun 29, 2011 at 2:10 PM, Harsh J <harsh@cloudera.com> wrote:

> Virajith,
>
> The FILE_BYTES_READ also counts all the reads of spilled records done
> during sorting of the various outputs between the MR phases.
>
> On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti
> <virajith.j@gmail.com> wrote:
> > I would like to clarify my earlier question: I found that each reducer
> > reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
> > REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ  78GB and
> not
> > just 25GB?
> >
> > Thanks,
> > Virajith
> >
> > On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalaparti <
> virajith.j@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I was running the Sort example in Hadoop 0.20.2
> >> (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated
> >> using randomwriter) with 800mappers (I was using 128MB of HDFS block
> size)
> >> and 4 reducers over a 3 machine cluster with 2 slave nodes. While the
> input
> >> and output were 100GB, I found that the intermediate data sent to each
> >> reducer was around 78GB, making the total intermediate data around
> 310GB. I
> >> dont really understand why there is an increase in data size given that
> the
> >> sort example just uses the identity mapper and identity reducer.
> >> Could someone please help me out with this?
> >>
> >> Thanks!!
> >
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message