crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ron (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-305) Multiuse between parellelDos which sharing the same input
Date Tue, 26 Nov 2013 08:15:35 GMT
Ron created CRUNCH-305:
--------------------------

             Summary: Multiuse between parellelDos which sharing the same input
                 Key: CRUNCH-305
                 URL: https://issues.apache.org/jira/browse/CRUNCH-305
             Project: Crunch
          Issue Type: Wish
            Reporter: Ron


  When I start to use crunch, many of my jobs are in this pattern: I have five different parallelDo
functions, and all of them work on a same input. Currently, I read the input first by using
"pipeline.readTextFile()", and then apply each parallelDo function to the PCollection. However,
I find that crunch will break my plan into five different mr jobs, each of them read the input
and do mr, so it need to read the input five times. However, when referring to the paper of
flumejava, the origin of crunch, I suggest that optimizations could be done that the input
only be read once, and then apply the five different paralledDo functions. Since the input
size is large, and the cost of IO is big, this optimization may help a lot in crunch jobs
in patterns similar to mine.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message