incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <>
Subject Question about mapreduce job planner
Date Tue, 15 Jan 2013 11:41:42 GMT
Hi all,

I've written a Crunch pipeline and have a question about the resulting
mapreduce jobs. Please see the steps below:

1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D

5. ) Write D to HDFS

6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase

Running my code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd

Without knowing much about the planning algorithm, what I expected to see
was more like:
job A) runs steps 1, 2, 3, 4, 5
job B) runs after A, reads back the data written in step 5 and does steps

So the jobs would run sequentially, not in parallel, but in doing so avoid
reading the full raw input data and performing the conversion/filtering
logic twice.

Is there a way I should order my pipeline calls or can I give hints to the
mapreduce compiler to do the jobs in this way? Does the scale factor have
any influence on this?


View raw message