incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused
Date Tue, 28 Aug 2012 01:08:08 GMT


Josh Wills commented on CRUNCH-50:

Love it-- we'll integrate the test when we integrate CRUNCH-34. Thanks Ryan!
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>                 Key: CRUNCH-50
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>         Attachments: MULTIOUTPUT_FUSE_TEST.patch
> I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling
parallelDo operations that have separate outputs are not being fused into a single Map operation.
 (The original FlumeJava paper suggests they should be, and it would be a nice optimization
to have here as well.)  Instead, each parallelDo results in a separate Map-only job that (redundantly)
scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test.  Notice the logs below
from running one of those tests scans the same input in multiple jobs.
> 8414 [Thread-38] INFO  - Running job "org.apache.crunch.MultipleOutputIT:
> 8415 [Thread-38] INFO  - Job status available
at: http://localhost:8080/
> 8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
> 8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 8532 [Thread-38] INFO  - Running job "org.apache.crunch.MultipleOutputIT:
> 8532 [Thread-38] INFO  - Job status available
at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major refactoring in
this space is on deck as part of it might be best to address this after CRUNCH-34
> As an aside, it wasn't clear how to write a good integration test to expose this functionality.
 Would simply counting the stage results and ensuring we have the expected number for a simple
job be the best way?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message