crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-510) PCollection.materialize with Spark should use collect()
Date Wed, 08 Apr 2015 20:09:12 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485924#comment-14485924
] 

Josh Wills commented on CRUNCH-510:
-----------------------------------

Should have said: the relevant bit of code in that Spark Dataflow file is around line 162.

> PCollection.materialize with Spark should use collect()
> -------------------------------------------------------
>
>                 Key: CRUNCH-510
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-510
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>
> When troubleshooting some other code noticed that when using the SparkPipeline and the
code forces a materialize() to be called...
> {code}
>       delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String, PageRankData>,
Float>() {
>         @Override
>         public Float map(Pair<String, PageRankData> input) {
>           PageRankData prd = input.second();
>           return Math.abs(prd.score - prd.lastScore);
>         }
>       }, ptf.floats())).getValue();
> {code}
> That the underlying code actually results in writing out the value to HDFS:
> {noformat}
> 15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332,
took 0.223622 s
> {noformat}
> Since Spark has the method collect() on RDDs, that should accomplish a similar bit of
functionality, I wonder if we could switch to use that and cut down on the need to persist
it to HDFS.  I think this is currently happening because of sharing logic between MRPipeline
and SparkPipeline and have no context about how we could possibly break it apart easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message