incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-144) Ability to re-use PCollections after a write without having to recompute them
Date Fri, 25 Jan 2013 14:33:13 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562730#comment-13562730
] 

Gabriel Reid commented on CRUNCH-144:
-------------------------------------

Looks fine to me, and I wouldn't say it's as ugly as you made it out to be (and sorry for
not taking a look earlier).

One small (cosmetic) change that I would make is on line 166 of MRPipeline, there's an instanceof
check against Source and then a cast to SourceTarget. It works out all the same, but I think
it would be more readable if the instanceof check was against SourceTarget.


                
> Ability to re-use PCollections after a write without having to recompute them
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-144
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-144b.patch, CRUNCH-144.patch
>
>
> I have a pipeline that consists of several stages to process and filter a dataset. I
would like to persist this dataset to HDFS and then perform further computation on it. 
> Example:
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
> 5. ) Write D to HDFS
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
> Running this pipeline code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
> If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.

> What I would like is to be able to hold on to the PCollection reference to "D", so that
steps 6* can be run without going back to the start and re-doing all the work needed to generate
it.
> -- 
> Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message