hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Configuring hadoop 2.2.0
Date Wed, 29 Jan 2014 13:38:18 GMT
Hi, Ognen:
I noticed you were asking this question before under a different subject line. I think you
need to tell us where you mean unbalance space, is it on HDFS or the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2
(Yarn), HDFS should work as itself, which means all HDFS command, API will just work.3) But
when you tried to copy file into HDFS using distcp, you need MR component (Doesn't matter
it is MR1 or MR2), as distcp indeed uses MapReduce to do the massively parallel copying files.4)
Your original problem is that when you run the distcp command, you didn't start the MR component
in your cluster, so distcp in fact copy your files to the LOCAL file system, based on some
one else's reply to your original question. I didn't test this myself before, but I kind of
believe that. 5) If the above is true, then you should see under node your were running distcp
command there should be having these files in the local file system, in the path you specified.
You should check and verify that.6) After you start yarn/resource manager, you see the unbalance
after you distcp files again. Where is this unbalance? In the HDFS or local file system. List
the commands  and outputs here, so we can understand your problem more clearly, instead of
misleading sometimes by your words.7) My suggest is that after you start the yarn/resource
managers, run some examples MR jobs coming with hadoop, to make sure your cluster working
as normal, then try your distcp command.

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: ognen@nengoiksvelzud.com
To: user@hadoop.apache.org

So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination
in addition to HDFS? My impression was what you are saying - that HDFS is independent of the
MR component.

Thanks! :)

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:


Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size
in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp
command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the node I ran
the command on has 70% of its space taken up while the rest of the nodes are at 10% local
space usage. All of the nodes started out with the same local space of 1.6TB mounted in the
same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has
been running since yesterday, maybe the 5% balancing threshold is too much?


On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <harsh@cloudera.com> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on


On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<ognen@nengoiksvelzud.com> wrote:

> Hello,


> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.


> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.


> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?


> Thanks,

> Ognen


Harsh J

View raw message