Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F32E10308 for ; Mon, 10 Mar 2014 03:33:57 +0000 (UTC) Received: (qmail 53446 invoked by uid 500); 10 Mar 2014 03:33:46 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 52478 invoked by uid 500); 10 Mar 2014 03:33:43 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 52459 invoked by uid 99); 10 Mar 2014 03:33:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 03:33:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of raofengyun@gmail.com designates 209.85.128.181 as permitted sender) Received: from [209.85.128.181] (HELO mail-ve0-f181.google.com) (209.85.128.181) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 03:33:37 +0000 Received: by mail-ve0-f181.google.com with SMTP id oy12so6420535veb.40 for ; Sun, 09 Mar 2014 20:33:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=wFS/smJLC8mQq0hUxCrmAhuYCRJP7uUOHgwoWqAk42w=; b=XPAo81dAdbUH5Dj8/vpnBL9QfyH7FjjROUps9HwAHfktJ9zNeRvF65pfrTN1/24P2h oG4SpxIQs74LF27mGbKoNNePpop4dKJzxhkuZ2H9aXk7u6D46WVVIfrYnvtN78uXwYLL 3IsjND+leoB4AVQXsejlmWBsFE6saOcjA0D8srVFmPhjowWVlGIVSw191O5Kg/SSjJ5j Jyb1rU3EkkF6QWcnLLSfEExXaqHFLqVAwT0YN80SpZQonncrShrUM/9wEb1jMZimfzU5 TQp1DbZvy9y2jd2zW3imaJKdMykOR80HFqAv3xHpWUyfv2KTqNHh067KlKeteNjQlZKr fN9w== MIME-Version: 1.0 X-Received: by 10.220.114.135 with SMTP id e7mr1174429vcq.23.1394422396573; Sun, 09 Mar 2014 20:33:16 -0700 (PDT) Received: by 10.220.232.68 with HTTP; Sun, 9 Mar 2014 20:33:16 -0700 (PDT) Date: Mon, 10 Mar 2014 11:33:16 +0800 Message-ID: Subject: What's the best practice for managing Hadoop dependencie? From: Fengyun RAO To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b342d8a9d669704f4384300 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b342d8a9d669704f4384300 Content-Type: text/plain; charset=ISO-8859-1 First of all, I want to claim that I used CDH5 beta, and managed project using maven, and I googled and read a lot, e.g. https://issues.apache.org/jira/browse/MAPREDUCE-1700 http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/ I believe the problem is quite common, when we write an MR job, we need lots of dependencies, which may not exist in or conflict with HADDOP_CLASSPATH. There are several options, e.g. 1. add all libraries to my own JAR, and set HADOOP_USER_CLASSPATH_FIRST=true This is what I do, which makes the jar very big, and still it doesn't work. e.g. I already packaged guava-16.0.jar in my package, but it still use guava-11.0.2.jar in the HADDOP_CLASSPATH. below is my build configuration. maven-assembly-plugin xxx.xxx.xxx.Runner jar-with-dependencies make-assembly package single 2. distinguish which library is not present in HADDOP_CLASSPATH, and put it into DistributedCache I think it's hard to distinguish, and still if it conflicts, which dependency would be precedent? *What's the best practice, especially using maven?* --047d7b342d8a9d669704f4384300 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
First of all, I want to claim that I used CDH5 beta, and m= anaged project using maven, and I googled and read a lot, e.g.=A0

I believe the problem is quite common, when we wr= ite an MR job, we need lots of dependencies,
which may not exist = in or conflict with HADDOP_CLASSPATH.=A0
There are several option= s, e.g.
1. add all libraries to my own JAR, and set HADOOP_USER_CLASSPATH_FIRS= T=3Dtrue
=A0 =A0This is what I do, which makes the jar very big, = and still it doesn't work.=A0
=A0 =A0e.g. I already packaged = guava-16.0.jar in my package, but it still use guava-11.0.2.jar in the HADD= OP_CLASSPATH.
=A0 =A0below is my build configuration.
=A0 =A0 =A0 =A0= =A0 =A0 <plugin>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <artif= actId>maven-assembly-plugin</artifactId>
=A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 <configuration>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <archive>
=A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <manifest>
=A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <mainClass>xxx.xx= x.xxx.Runner</mainClass>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 </manifest>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 </archive>
=A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <descriptorRefs>
=A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <descriptorRef>jar-with-d= ependencies</descriptorRef>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 </descriptorRefs>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 </configuration>
=A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 <executions>
=A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 <execution>
=A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 <id>make-assembly</id>
=A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <phase>package</phase&= gt;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <goals>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <goal>singl= e</goal>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &l= t;/goals>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 </executi= on>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 </executions>
=A0 =A0 =A0 = =A0 =A0 =A0 </plugin>

2. distinguish w= hich library is not present in HADDOP_CLASSPATH, and put it into Distribute= dCache
=A0 =A0 I think it's hard to distinguish, and still if= it conflicts, which dependency would be precedent?


What's the best practice, especia= lly using maven?


--047d7b342d8a9d669704f4384300--