crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: HFileOutputFormatForCrunch with spark pipeline
Date Thu, 13 Aug 2015 03:10:40 GMT
Hey Surbhi,

I think it's just a bug-- Crunch-on-Spark should be handling the
partitioner stuff correctly w/o requiring you to write your own. I think
the problem is that we set the location of the partition file (the one that
the code is mad that it can't find in your gist) inside of the
GroupingOptions class, and we're not updating the Configuration object that
the Spark job is going to use w/the location of that file in the same way
we do on MapReduce. I'll file a bug for it and see if I can't come up w/a
fix and unit test tomorrow.


On Wed, Aug 12, 2015 at 10:45 AM, Surbhi Mungre <>

> I am converting a MRPipeline to SparkPipeline with these[1] instructions.
> My SparkPipeline fails with this[2] exception. In my pipeline I am trying
> to write to HBase using HFiles. IIUC M/R job which creates HFiles uses a
> custom partitioner. I am not sure how Crunch translates this to Spark. From
> the exception stack trace it looks like Spark is using M/R partitioner. I
> am completely new to Spark but I think I will have to create a custom spark
> partitioner and use it instead. When I am converting a MRPipeline to
> SparkPipeline, if a M/R job uses custom partitioner will Crunch handle it?
> [1]
> [2]
> Thanks,
> Surbhi

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message