Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A03BFDCA5 for ; Fri, 24 Aug 2012 03:44:44 +0000 (UTC) Received: (qmail 70727 invoked by uid 500); 24 Aug 2012 03:44:44 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 70687 invoked by uid 500); 24 Aug 2012 03:44:44 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 70648 invoked by uid 99); 24 Aug 2012 03:44:43 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 03:44:43 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 713322C0A66 for ; Fri, 24 Aug 2012 03:44:42 +0000 (UTC) Date: Fri, 24 Aug 2012 14:44:42 +1100 (NCT) From: "Ryan Brush (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: <611193050.9210.1345779882464.JavaMail.jiratomcat@arcas> In-Reply-To: <1177484042.9201.1345779762655.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Brush updated CRUNCH-50: ----------------------------- Description: I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling parallelDo operations that have separate outputs are not being fused into a single Map operation. (The original FlumeJava paper suggests they should be, and it would be a nice optimization to have here as well.) Instead, each parallelDo results in a separate Map-only job that (redundantly) scans the input source of data. This can be seen in the current MultipleOutputIT integration test. Notice the logs below from running one of those tests scans the same input in multiple jobs. 8414 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)" 8415 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ 8417 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 8497 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)" 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ I was going to take a stab at a patch for this, but noticed some major refactoring in this space is on deck as part of CRUNCH-34...so it might be best to address this after CRUNCH-34 lands. As an aside, it wasn't clear how to write a good integration test to expose this functionality. Would simply counting the stage results and ensuring we have the expected number for a simple job be the best way? was: I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling parallelDo operations that have separate outputs are not being fused into a single Map operation. (The original FlumeJava paper suggest they should be, and it would be a nice optimization to have here as well.) Instead, each parallelDo results in a separate Map-only job that (redundantly) scans the input source of data. This can be seen in the current MultipleOutputIT integration test. Notice the logs below from running one of those tests scans the same input in multiple jobs. 8414 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)" 8415 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ 8417 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 8497 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)" 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ I was going to take a stab at a patch for this, but noticed some major refactoring in this space is on deck as part of CRUNCH-34...so it might be best to address this after CRUNCH-34 lands. As an aside, it wasn't clear how to write a good integration test to expose this functionality. Would simply counting the stage results and ensuring we have the expected number for a simple job be the best way? > Map-only parallelDos with multiple outputs should be fused > ---------------------------------------------------------- > > Key: CRUNCH-50 > URL: https://issues.apache.org/jira/browse/CRUNCH-50 > Project: Crunch > Issue Type: Improvement > Reporter: Ryan Brush > > I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling parallelDo operations that have separate outputs are not being fused into a single Map operation. (The original FlumeJava paper suggests they should be, and it would be a nice optimization to have here as well.) Instead, each parallelDo results in a separate Map-only job that (redundantly) scans the input source of data. > This can be seen in the current MultipleOutputIT integration test. Notice the logs below from running one of those tests scans the same input in multiple jobs. > 8414 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)" > 8415 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ > 8417 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. > 8497 [Thread-38] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). > 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)" > 8532 [Thread-38] INFO org.apache.crunch.impl.mr.exec.CrunchJob - Job status available at: http://localhost:8080/ > I was going to take a stab at a patch for this, but noticed some major refactoring in this space is on deck as part of CRUNCH-34...so it might be best to address this after CRUNCH-34 lands. > As an aside, it wasn't clear how to write a good integration test to expose this functionality. Would simply counting the stage results and ensuring we have the expected number for a simple job be the best way? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira