Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Wed, 8 Apr 2015 19:12:13 +0000 (UTC)
From: "Micah Whitacre (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12819428.1428520277000.30426.1428520333388@Atlassian.JIRA>
In-Reply-To: <JIRA.12819428.1428520277000@Atlassian.JIRA>
References: <JIRA.12819428.1428520277000@Atlassian.JIRA>
 <JIRA.12819428.1428520277742@arcas>
Subject: [jira] [Created] (CRUNCH-510) PCollection.materialize with Spark
 should use collect()
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Micah Whitacre created CRUNCH-510:
-------------------------------------

             Summary: PCollection.materialize with Spark should use collect()
                 Key: CRUNCH-510
                 URL: https://issues.apache.org/jira/browse/CRUNCH-510
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Micah Whitacre
            Assignee: Josh Wills


When troubleshooting some other code noticed that when using the SparkPipeline and the code forces a materialize() to be called...

{code}
      delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String, PageRankData>, Float>() {
        @Override
        public Float map(Pair<String, PageRankData> input) {
          PageRankData prd = input.second();
          return Math.abs(prd.score - prd.lastScore);
        }
      }, ptf.floats())).getValue();
{code}

That the underlying code actually results in writing out the value to HDFS:

{noformat}
15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.223622 s
{noformat}

Since Spark has the method collect() on RDDs, that should accomplish a similar bit of functionality, I wonder if we could switch to use that and cut down on the need to persist it to HDFS.  I think this is currently happening because of sharing logic between MRPipeline and SparkPipeline and have no context about how we could possibly break it apart easily.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)