hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chase Bradford <chase.bradf...@gmail.com>
Subject Total Order Partitioning setup code
Date Sat, 17 Jul 2010 20:44:01 GMT
A few weeks ago I had a need for total order sorting, and I couldn't
find any easy to use, general purpose code for getting the partition
file setup.

Attached is the code I settled on.  Would it be worthwhile to add to
hadoop-mapred?

My goals were:
1) Extract all needed setup from an existing configuration
2) Run quickly
3) Easy to use

It accepts a single Configuration, which should be a ready to run job.
It then changes the input format and reducer to custom classes.
The sampling inputformat uses the original inputformat's record
reader, but makes the input appear much smaller to the mapper
(1/1000th by default).
Those inputs go through the original mapper, and that mapper's output
is sent to a single partition.
The original reducer is replaced with a simple sampling reducer that
only emits mapred.reduce.tasks-1 records.

It reads the entire input set, so that should go fast.  Also, the
mapper shouldn't have any side effects, like inserting into HBase,
since the prep job uses the original.

I took this approach because I have many jobs where the mapper changes
the key type from what the InputFormat provides.  It also made the
sampler independent of the job's input type.

I don't have any example programs, but if there's interest, I can provide some.

Thanks,
Chase Bradford

--

“If in physics there's something you don't understand, you can always
hide behind the uncharted depths of nature. But if your program
doesn't work, there is no obstinate nature. If it doesn't work, you've
messed up.”

- Edsger Dijkstra

Mime
View raw message