crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-144) Ability to re-use PCollections after a write without having to recompute them
Date Thu, 17 Jan 2013 14:56:13 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556238#comment-13556238
] 

Josh Wills commented on CRUNCH-144:
-----------------------------------

An update on this-- this caused some issues with the AvroPipelineIT test, where the optimizer
doesn't realize that it can't figure out how to read the Avro objects from the text file that
it writes out during the first pipeline run. I need to add more strict rules for indicating
when it's possible to consume some text output, but it looks sort of ugly right now.
                
> Ability to re-use PCollections after a write without having to recompute them
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-144
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-144.patch
>
>
> I have a pipeline that consists of several stages to process and filter a dataset. I
would like to persist this dataset to HDFS and then perform further computation on it. 
> Example:
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
> 5. ) Write D to HDFS
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
> Running this pipeline code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
> If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.

> What I would like is to be able to hold on to the PCollection reference to "D", so that
steps 6* can be run without going back to the start and re-doing all the work needed to generate
it.
> -- 
> Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message