hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: Configuring hadoop 2.2.0
Date Wed, 29 Jan 2014 12:37:40 GMT

Thanks for your reply. What happens is this: I have about 70 files, all
about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now,
the node I ran the command on has 70% of its space taken up while the rest
of the nodes are at 10% local space usage. All of the nodes started out
with the same local space of 1.6TB mounted in the same exact partition
/extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with
-threshold 5. It has been running since yesterday, maybe the 5% balancing
threshold is too much?


On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <harsh@cloudera.com> wrote:

> I don't believe what you've been told is correct (IIUC). HDFS is an
> independent component and does not require presence of YARN (or MR) to
> function correctly.
> What do you exactly mean when you say "files are only stored on the
> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
> local FS / result list or does it show a true HDFS directory listing?
> Your problem may simply be configuring clients right - depending on
> this.
> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
> <ognen@nengoiksvelzud.com> wrote:
> > Hello,
> >
> > I have set up an HDFS cluster by running a name node and a bunch of data
> > nodes. I ran into a problem where the files are only stored on the node
> that
> > uses the hdfs command and was told that this is because I do not have a
> job
> > tracker and task nodes set up.
> >
> > However, the documentation for 2.2.0 does not mention any of these (at
> least
> > not this page:
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
> ).
> > I browsed some of the earlier docs and they do mention job tracker nodes
> > etc.
> >
> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine
> > to be the "job tracker"? Did this job tracker node change its name to
> > something else in the current docs?
> >
> > Thanks,
> > Ognen
> --
> Harsh J

View raw message