hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Ruch <rutschifen...@gmail.com>
Subject Filtering by value in Reducer
Date Mon, 11 May 2015 17:08:50 GMT
Hi,

I am currently playing around with Hadoop and have some problems when
trying to filter in the Reducer.

I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
some additional functionality
and added the possibility to filter by the specific value of each key -
e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer
#####################################

for (IntWritable val : values) {
     sum += val.get();
}
if ( sum > threshold ) {
     result.set(sum);
     context.write(key, result);
}

#####################################

For threshold smaller any value the above code works as expected and the
output contains all key-value pairs.
If I increase the threshold to 1 some pairs are missing in the output
although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as
intended. I use the exact Tutorial setup with Oracle JDK 8
on a CentOS 7 machine.

As far as I understand the respective Iterable<...>  in the Reducer already
contains all the observed values for a specific key.
Why is it possible that I am missing some of these key-value pairs then? It
only fails in very few cases. The input file is pretty large - 250 MB -
so I also tried to increase the memory for the mapping and reduction steps
but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced
than I am.


Thank you,

Peter

Mime
View raw message