Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Date: Wed, 23 Sep 2015 22:44:04 +0000 (UTC)
From: "Josh Wills (JIRA)" <jira@apache.org>
To: crunch-dev@incubator.apache.org
Message-ID: <JIRA.12861419.1441234463000.55848.1443048244682@Atlassian.JIRA>
In-Reply-To: <JIRA.12861419.1441234463000@Atlassian.JIRA>
References: <JIRA.12861419.1441234463000@Atlassian.JIRA>
 <JIRA.12861419.1441234463983@arcas>
Subject: [jira] [Updated] (CRUNCH-557) Fix file distribution from HDFS in
 Crunch-on-Spark
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CRUNCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills updated CRUNCH-557:
------------------------------
    Attachment: CRUNCH-557b.patch

Hey [~smungre], try this one out and let me know if it does the trick for the BloomFilter test.

> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>
>                 Key: CRUNCH-557
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-557
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-557.patch, CRUNCH-557a.patch, CRUNCH-557b.patch
>
>
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not tried to do any configuration changes but I did run tests with datasets of different sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.
> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
> The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme for the URI it's handing to Spark.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)