hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Mudd <Joe.M...@sas.com>
Subject Hadoop C++ Pipes Combiner.close() may be called before Combiner.reduce()
Date Wed, 23 Apr 2014 11:50:02 GMT
When using the Hadoop 2.3.0 distribution of Hadoop Pipes from C++, I found that if a Combiner
is specified, the Combiner close() method is called before all of the Combiner reduce() methods
have been called.  This call pattern differs from the normal Reducer call pattern (init()...reduce()*...close()).

Shouldn't the Combiner call sequence be the same as the Reducer call sequence?

After reviewing HadoopPipes.cc, the change in the call pattern appears to be caused by the
Combiner instance being wrapped by the CombineRunner writer.  And, how TaskContextImpl::closeAll()
closes the writer after the reducer.  This means the Combiner close is called before CombineRunner::splitAll(),
via CombineRunner::close(), has had a chance to call reduce() on all of its collected key/values.

I believe the fix would be to delegate the Combiner ownership to the CombineRunner instance.
 The CombineRunner could ensure the Combiner has combined all data by calling the Combiner
close() method from the CombineRunner::close() method after the splitAll().  And, to complete
the Combiner cleanup, the CombineRunner destructor would need to delete the combiner instance.

Should this be submitted as a bug?


View raw message