hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moores <mmoo...@real.com>
Subject Re: Specifying the InputFormat class that exists in a JAR on the hdfs
Date Thu, 14 Oct 2010 17:17:45 GMT
I moved back from hadoop 21.0 to 20.2 and things look better.

But I'm a little confused on how things are working:

My InputFormat class attempts to connect to cassandra on localhost.
I have JobTracker/NameNode running on one server, and TaskTracker/DataNode running on 8 other
machines (slaves).
I also have cassandra running on those hadoop slaves.  

I execute the hadoop job on the JobTracker machine, and I get a connection refused exception
attempting to connect to cassandra.
I expected the InputFormat to run on the 8 TaskTracker machines..  but it looks like it's
just running locally.






On Oct 13, 2010, at 4:47 PM, Shrijeet Paliwal wrote:

> Also you dont necessarily need to use DistributedCache API from your
> application. You can supply  libjars flag from command line to supply
> additional jars to mappers and reducers.
> 
> Take a look :
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Usage  (look
> for libjars option)
> 
> On Wed, Oct 13, 2010 at 4:41 PM, Shrijeet Paliwal
> <shrijeet@rocketfuel.com>wrote:
> 
>> Do that only on the machine which is launching the job.
>> 
>> 
>> On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mmoores@real.com> wrote:
>> 
>>> Add it to HADOOP_CLASSPATH on all machines running the task?
>>> I can try that, but I'd like users to be able to execute jobs using jars
>>> from their own hdfs directory.
>>> 
>>> 
>>> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>>> 
>>>> How about adding it to HADOOP_CLASSPATH if not already.
>>>> 
>>>> On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mmoores@real.com>
>>> wrote:
>>>> 
>>>>> fyi- I also tried thr archive version--
>>>>> 
>>>>> calling DistributedCache.addArchiveToClassPath(path, configuration);
>>>>> 
>>>>> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>>>>> 
>>>>>> I have specified my InputFormat to be the cassandra
>>>>> ColumnFamilyInputFormat, and also
>>>>>> added the cassandra JAR to my classpath via a call to
>>>>> DistributedCache.addFileToClassPath().
>>>>>> The JAR exists on the HDFS.
>>>>>> When I run my jar I get  java.lang.NoClassDefFoundError:
>>>>> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>>>>>> makes the job.setInputFormatClass() call.
>>>>>> 
>>>>>> I execute the job with "hadoop jar <myjar>".
>>>>>> 
>>>>>> Will I need to put my cassandra JAR on each machine and add it to
the
>>> JVM
>>>>> startup options???
>>>>>> 
>>>>>> Here is a code snippet:
>>>>>> 
>>>>>> public class MyStats extends Configured implements Tool {
>>>>>> ...
>>>>>> public static void main(String[] args) throws Exception {
>>>>>>      // Let ToolRunner handle generic command-line options
>>>>>>      Configuration configuration = new Configuration();
>>>>>>      Path path = new
>>>>> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>>>>>>      log.info("main: adding jars...");
>>>>>>      DistributedCache.addFileToClassPath(path, configuration);
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>      ToolRunner.run(configuration, new MyStats(), args);
>>>>>>      System.exit(0);
>>>>>>  }
>>>>>> 
>>>>>> public int run(String[] args) throws Exception {
>>>>>>    Job job = new Job(getConf(), "myjob");
>>>>>> 
>>>>> 
>>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>>>>>>    ..
>>>>>>    job.waitForCompletion(true);
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> FILE LISTING from HDFS:
>>>>>> 
>>>>>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>>>>>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>>>>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>>>>> cacheTimeout=300000
>>>>>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
>>> deprecated.
>>>>> Instead, use mapreduce.task.attempt.id
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
>>>>> /user/hadoop/profilestats
>>>>>> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
>>>>> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>>>>> 
>>>>> 
>>> 
>>> 
>> 


Mime
View raw message