hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason <urg...@gmail.com>
Subject Re: Is there any way I could keep both the Mapper and Reducer output in hdfs?
Date Tue, 03 May 2011 15:25:34 GMT
It is actually trivial to do using MultipleOutputs. You just need to emit your key-values to
both MO and standard output context/collector in your mapper.

Two things you should know about MO:
1. Early implementation has a serious (couple of order of magnitude) performance bug
2. Output files not created for empty output data. 

On May 2, 2011, at 11:09 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:

> Dear all,
> We have a task to run a map-reduce job multiple times to do some machine learning calculation.
We will first use a mapper to update the data iteratively, and then use the reducer to process
the output of the mapper to update a global matrix. After that, we need to re-use the output
of the previous mapper(as a datasource) and reducer(as a set of parameters) to re-run the
map-reduce again to do another round of learning.
> I am wondering is there any setting or API I could use to let the hadoop to keep both
the output of the mapper and reducer? Now it looks if it is a job contains a reducer, it will
delete the intermediate result generated by the mapper.
> Thanks.
> Stanley Xu

View raw message