incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused
Date Tue, 28 Aug 2012 01:08:08 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442864#comment-13442864
] 

Josh Wills commented on CRUNCH-50:
----------------------------------

Love it-- we'll integrate the test when we integrate CRUNCH-34. Thanks Ryan!
                
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>
>                 Key: CRUNCH-50
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-50
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>         Attachments: MULTIOUTPUT_FUSE_TEST.patch
>
>
> I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling
parallelDo operations that have separate outputs are not being fused into a single Map operation.
 (The original FlumeJava paper suggests they should be, and it would be a nice optimization
to have here as well.)  Instead, each parallelDo results in a separate Map-only job that (redundantly)
scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test.  Notice the logs below
from running one of those tests scans the same input in multiple jobs.
> 8414 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job "org.apache.crunch.MultipleOutputIT:
Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
> 8415 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status available
at: http://localhost:8080/
> 8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
> 8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file set.  User
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job "org.apache.crunch.MultipleOutputIT:
Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status available
at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major refactoring in
this space is on deck as part of CRUNCH-34...so it might be best to address this after CRUNCH-34
lands.
> As an aside, it wasn't clear how to write a good integration test to expose this functionality.
 Would simply counting the stage results and ensuring we have the expected number for a simple
job be the best way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message