crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ron (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-305) Multiuse between parellelDos which sharing the same input
Date Tue, 26 Nov 2013 08:40:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832407#comment-13832407
] 

Ron commented on CRUNCH-305:
----------------------------

I have a careful reading of crunch future work on http://crunch.apache.org/future-work.html,
and found that this is already in the future work of crunch, as combine related groupByKey
into one single MR job like flumejava does. 

> Multiuse between parellelDos which sharing the same input
> ---------------------------------------------------------
>
>                 Key: CRUNCH-305
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-305
>             Project: Crunch
>          Issue Type: Wish
>            Reporter: Ron
>
>   When I start to use crunch, many of my jobs are in this pattern: I have five different
parallelDo functions, and all of them work on a same input. Currently, I read the input first
by using "pipeline.readTextFile()", and then apply each parallelDo function to the PCollection.
However, I find that crunch will break my plan into five different mr jobs, each of them read
the input and do mr, so it need to read the input five times. However, when referring to the
paper of flumejava, the origin of crunch, I suggest that optimizations could be done that
the input only be read once, and then apply the five different paralledDo functions. Since
the input size is large, and the cost of IO is big, this optimization may help a lot in crunch
jobs in patterns similar to mine.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message