hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: Configuring hadoop 2.2.0
Date Wed, 29 Jan 2014 14:05:34 GMT
Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <java8964@hotmail.com> wrote:

> Hi, Ognen:
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
> 1) The HDFS is independent as MR. They are not related to each other.

OK good to know.

> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of

> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.


> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)

> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs:// where is the HDFS Name node. I am
running this on which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data


View raw message