crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Comment Edited] (CRUNCH-557) Fix file distribution from HDFS in Crunch-on-Spark
Date Thu, 03 Sep 2015 02:52:45 GMT


Micah Whitacre edited comment on CRUNCH-557 at 9/3/15 2:52 AM:

Sure, I'll try and write up some tests for this too.

Looks like we have tests for Mapside Joins already.  Based on the fix that Josh proposed I'm
guessing the reason those tests didn't pass the tests is because the FS scheme was the same
so it didn't matter.  I'm going to setup some tests using a MiniDFSCluster.  I think we could
possibly reuse the instance being created in SparkHFileTargetIT that is created as a by-product
of the HBaseTestingUtility.

was (Author: mkwhitacre):
Sure, I'll try and write up some tests for this too.

> Fix file distribution from HDFS in Crunch-on-Spark
> --------------------------------------------------
>                 Key: CRUNCH-557
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-557.patch
> From the user list:
> I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed
that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working
with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1]
on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages.
I have not tried to do any configuration changes but I did run tests with datasets of different
sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in
yarn-client mode with Crunch 0.11.0-cdh5.4.2.
> [1]
> [2]
> The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme
for the URI it's handing to Spark.

This message was sent by Atlassian JIRA

View raw message