hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Earney, Billy C." <ear...@umsystem.edu>
Subject hadoop client
Date Tue, 11 Sep 2007 20:11:01 GMT
Greetings!

I've been reading through the documentation, and there is one piece of
information I'm not finding (or I missed..).  Lets say you have a
cluster of machines one being the namenode, and the rest serving as
datanodes.  Does a client process (a process trying to
insert/delete/read files) need to be running on the namenode or
datanodes? (or can it run on another machine?).

If a client process can run on another machine, can someone give an
example and the configuration to do such a thing?  I've seen there has
been some work done on webdav with hadoop, and was wondering if a
machine not part of cluster could access HDFS with something like webdav
(or similar tool)?

Thanks!

-----Original Message-----
From: Tom White [mailto:tom.e.white@gmail.com] 
Sent: Tuesday, September 11, 2007 2:16 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Accessing S3 with Hadoop?

> I just updated the page to add a Notes section explaining the issue
> and referencing the JIRA issue # you mentioned earlier.

Great - thanks.

> > Are you able to do 'bin/hadoop-ec2 launch-cluster' then (on your
workstation)
> >
> > . bin/hadoop-ec2-env.sh
> > ssh $SSH_OPTS "root@$MASTER_HOST" "sed -i -e
> > \"s/$MASTER_HOST/\$(hostname)/g\"
> > /usr/local/hadoop-$HADOOP_VERSION/conf/hadoop-site.xml"
> >
> > and then check to see if the master host has been set correctly (to
> > the internal IP) in the master host's hadoop-site.xml.
>
> Well, no, since my $MASTER_HOST is now just the external DNS name of
> the first instance started in the reservation, but this is performed
> as part of my launch-hadoop-cluster script. In any case, that value is
> not set to the internal IP, but rather to the hostname portion of the
> internal DNS name.

This is a bit of a mystery to me - I'll try to reproduce it in on my
workstation.

>
> Currently, my MR jobs are failing because the reducers can't copy the
> map output and I'm thinking it might be because there is some kind of
> external address getting in there somehow. I see connections to
> external IPs in netstat -tan (72.* addresses). Any ideas about that?
> In the hadoop-site.xml's on the slaves, the address is the external
> DNS name of the master (ec2-*) but that resolves to the internal 10/8
> address like it should.
>
> > Also, what version of the EC2 tools are you using?
>
> black:~/code/hadoop-0.14.0/src/contrib/ec2> ec2-version
> 1.2-11797 2007-03-01
> black:~/code/hadoop-0.14.0/src/contrib/ec2>

I'm using the same version so that's not it.

> > Instances are terminated on the basis of their AMI ID since 0.14.0.
> > See https://issues.apache.org/jira/browse/HADOOP-1504.
>
> I felt this was unsafe as it was, since it looked for a name of an
> image and then reversed it to the AMI ID. I just hacked it so you have
> to put in the AMI ID in hadoop-ec2-env.sh. Also, the script as it is
> right now doesn't grep for 'running' so may potentially shut down some
> instances starting up in another cluster. I may just be paranoid,
> however ;)

Checking for 'running' is a good idea. I've relied on version number
so folks can easily select the version of hadoop they want on the
cluster. Perhaps the best solution would be to allow an optional
parameter to the terminate script to specify the AMI ID if you need
extra certainty (the script already prompts with a list of instances
to terminate).

Tom

Mime
View raw message