crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Created] (CRUNCH-510) PCollection.materialize with Spark should use collect()
Date Wed, 08 Apr 2015 19:12:13 GMT
Micah Whitacre created CRUNCH-510:

             Summary: PCollection.materialize with Spark should use collect()
                 Key: CRUNCH-510
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Micah Whitacre
            Assignee: Josh Wills

When troubleshooting some other code noticed that when using the SparkPipeline and the code
forces a materialize() to be called...

      delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String, PageRankData>,
Float>() {
        public Float map(Pair<String, PageRankData> input) {
          PageRankData prd = input.second();
          return Math.abs(prd.score - prd.lastScore);
      }, ptf.floats())).getValue();

That the underlying code actually results in writing out the value to HDFS:

15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at,
took 0.223622 s

Since Spark has the method collect() on RDDs, that should accomplish a similar bit of functionality,
I wonder if we could switch to use that and cut down on the need to persist it to HDFS.  I
think this is currently happening because of sharing logic between MRPipeline and SparkPipeline
and have no context about how we could possibly break it apart easily.

This message was sent by Atlassian JIRA

View raw message