Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B06667B4 for ; Thu, 26 May 2011 14:46:02 +0000 (UTC) Received: (qmail 83038 invoked by uid 500); 26 May 2011 14:45:57 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 82561 invoked by uid 500); 26 May 2011 14:45:57 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 82553 invoked by uid 99); 26 May 2011 14:45:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2011 14:45:57 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [75.150.60.225] (HELO adam.ccri.com) (75.150.60.225) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2011 14:45:49 +0000 Received: from mail.ccri.com (adam.ccri.com [192.168.2.131]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by adam.ccri.com (Postfix) with ESMTPSA id 404492601DF for ; Thu, 26 May 2011 10:45:28 -0400 (EDT) MIME-Version: 1.0 Date: Thu, 26 May 2011 10:45:28 -0400 From: John Armstrong To: Subject: Problems adding JARs to distributed classpath in Hadoop 0.20.2 Organization: CCRi Message-ID: <2a72811dedf48750427292c1563fdb3a@adam.ccri.com> X-Sender: john.armstrong@ccri.com User-Agent: RoundCube Webmail/0.3.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Hi, everybody. I'm running into some difficulties getting needed libraries to map/reduce tasks using the distributed cache. I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement by the client, so more current versions are not really viable options. The code I've inherited is Java, which sets up and runs the MR job. There's currently some nontrivial pre- and post-processing, so it will be a large refactoring before I can just run bare MR jobs rather than starting them through Java. Further complicating matters: in practice the Java jobs are launched by Oozie, which of course does so by wrapping each one in a MR shell. The upshot is that I don't have any control over which "local" filesystem the Java job is run from, though if local files are absolutely needed I can make my Java wrappers copy stuff back from HDFS to the Java job's local filesystem. So here's the problem mappers and/or reducers need class Needed, which is contained in needed-1.0.jar, which is in HDFS: hdfs://.../libdir/distributed/needed-1.0.jar Java program executes: DistributedCache.addFiletoClassPath(new Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration()); Inspecting the Job object I find the file has been added to the cache files as expected: job.conf.overlay[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar job.conf.properties[...] = mapred.cache.files -> hdfs://.../libdir/distributed/needed-1.0.jar And the class seems to show up in the internal ClassLoader: job.conf.classLoader.classes[...] = "class my.class.package.Needed" though this may just be inherited from the ClassLoader of the Java process itself (which also uses Needed). And yet as soon as I get into the mapreduce job itself I start getting: 2011-05-25 17:22:56,080 INFO JobClient - Task Id : attempt_201105251330_0037_r_000043_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: my.class.package.Needed Up until this point we've run things by having a directory on each node containing all the libraries we'd need, and including that in the Hadoop classpath, but we have no such control in the deployment scenario, so we have to make our program hand the needed libraries to the map and reduce nodes via the distributed cache classpath. Thanks in advance for any insight or assistance you can offer.