crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: ClassNotFoundException: Class org.apache.crunch.impl.mr.run.CrunchMapper
Date Tue, 02 Dec 2014 22:24:29 GMT
>> The primary job, which implements Tool, is able to run, it's just the
jobs launched by the doFn() which fail.

You mean from the pipeline.run/done calls right and not actually DoFn?  The
reason I'm asking is if you are launching jobs in a DoFn then that might
relate to some issues.

As far as Oozie and Crunch integration, typically you specify the driver
class when creating the MRPipeline instance.  This will help find the jar
containing the driver and automatically push it to DistributedCache.  If
the jar has more dependencies needed to run I believe those need to be
specified through the "-libjars" argument when launching.[1]  This should
flow through the Configuration object that Tool/ToolRunner pass in and you
ideally use to create your Pipeline.

I haven't checked it out in a bit but you could look at tools like Kite
which has a maven plugin which can help to generate the "-libjars" command
line options and would handle DistributedCache for you.  Last I looked i
had some limitations out of the box but could be a pattern you could
emulate.[3]

Crunch does has a class DistCache[3] that has a few convenience methods for
pushing those files into HDFS.

[1] -
http://stackoverflow.com/questions/23862309/oozie-throws-java-lang-classnotfoundexception
[2] -
http://kitesdk.org/docs/current/kite-maven-plugin/package-app-mojo.html
[3] -
http://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/util/DistCache.html

On Tue, Dec 2, 2014 at 4:07 PM, Mike Barretta <mike.barretta@gmail.com>
wrote:

> FWIW, I solved this by manually adding all necessary jars into the
> DistributedCache...ugly, but effective!
>
> On Wed, Nov 26, 2014 at 12:29 PM, Mike Barretta <mike.barretta@gmail.com>
> wrote:
>
>> Thank you for the quick reply.
>>
>> I am indeed using the Oozie workflow lib directory as described here:
>> http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html#a7_Workflow_Application_Deployment.
>>
>>
>> The primary job, which implements Tool, is able to run, it's just the
>> jobs launched by the doFn() which fail.  Is there a step where I might need
>> to tell the Crunch pipeline about the jars loaded by Oozie?
>>
>> On Fri, Nov 21, 2014 at 5:27 PM, Micah Whitacre <mkwhitacre@gmail.com>
>> wrote:
>>
>>> The support of a lib folder inside of a jar is not necessarily
>>> guaranteed to be supported on all versions of Hadoop.[1]
>>>
>>> We typically go with the "uber" jar where we use maven-shade-plugin to
>>> actually explode the crunch dependencies and others into the assembly jar.
>>> Another approach since you are using Oozie is to include the jar in the
>>> workflow lib directory.  That should put the jar on the classpath.  The
>>> last approach is obviously to manually use DistributedCache yourself which
>>> will distribute it out to the cluster.
>>>
>>> [1] -
>>> http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>>>
>>> On Fri, Nov 21, 2014 at 4:15 PM, Mike Barretta <mike.barretta@gmail.com>
>>> wrote:
>>>
>>>> All,
>>>>
>>>> I'm running an MRPipeline from crunch-core 0.11.0-hadoop2 on a CDH5.1
>>>> cluster via oozie.  While the main job runs okay, the doFn() it calls fails
>>>> due to the CNFE.  The jar containing my classes does indeed contain
>>>> lib/crunch-core-0.11.0-hadoop2.jar.
>>>>
>>>> Does the crunch jar need to be added to the hadoop lib on all nodes?
>>>> It seems like that would/should be unnecessary.
>>>>
>>>> Thanks,
>>>> Mike
>>>>
>>>
>>>
>>
>

Mime
View raw message