hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: InputSplits in Mapper
Date Sun, 06 Jun 2010 23:53:49 GMT
Quite a reasonable workaround to my mind. Though having each mapper
calculate the list of inputsplits may be costly.

Instead, consider this:

On your client, configure the Job instance.
Call new FooInputFormat().getSplits(theJob) and save the resulting
List<InputSplit> in serialized form to a file -- inject that file into the
distributed cache.  Then you only had to calculate your list twice (once for
your file, once for the job itself)
Then start the job normally.
Have all mappers read the file from the distributed cache.

- Aaron

On Sun, Jun 6, 2010 at 4:01 AM, Torsten Curdt <tcurdt@vafer.org> wrote:

> > No, there isn't an api for that.
> Bummer.
> > The data is actually available in HDFS, but
> > it is considered an internal format and in particular has changed
> > substantially between 0.20 and 0.21/trunk.
> Na ...I was after an API for this.
> Since I control the splits from a custom input format, I could
> probably just create a new instance of the InputFormat and call
>  List<InputSplit> getSplits(JobContext jobContext)
> on it. I would think it should give the same result as if called from
> the mapreduce framework.
> It's a work around though.

View raw message