Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ognen@nengoiksvelzud.com
 designates 209.85.128.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <SNT149-W52EADC81CC9DB8A080B977D0AC0@phx.gbl>
References: 
 <CAC_RXPEUy=FDwiu54VT_yG6W+Z8_qLO6=tDcB+PZQJsJX9q-zQ@mail.gmail.com>
	<CAOcnVr1ghmxNefp9vX7m8OT=9m6c7kFm74t83-P1N4Dwd4nm=g@mail.gmail.com>
	<CAC_RXPE0mYZPA95-hjn2VCqHvHZuMdi69W2Ag6qDVOE2UhqyoQ@mail.gmail.com>
	<CAC_RXPGGo9t7MP9rj_ntYf=kspbW0noczAN=+GY5Rep_qNq3mQ@mail.gmail.com>
	<SNT149-W52EADC81CC9DB8A080B977D0AC0@phx.gbl>
Date: Wed, 29 Jan 2014 08:05:34 -0600
Message-ID: 
 <CAC_RXPGtqcp_UXPrp0ynBv3yzJTOfcVv-vC5us+tjSg+L28SqA@mail.gmail.com>
Subject: Re: Configuring hadoop 2.2.0
From: Ognen Duzlevski <ognen@nengoiksvelzud.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf307f34b2400b5a04f11c6f6d

--20cf307f34b2400b5a04f11c6f6d
Content-Type: text/plain; charset=ISO-8859-1

Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <java8964@hotmail.com> wrote:

> Hi, Ognen:
>
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
>
> 1) The HDFS is independent as MR. They are not related to each other.
>

OK good to know.


> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.
>

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of
space?


> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.
>

Understood.


> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.
>

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)


> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.
>

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
running this on 10.10.0.200 which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.
>

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data
nodes.

Thanks!
Ognen

--20cf307f34b2400b5a04f11c6f6d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello (and thanks for replying!) :)<br><div><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">On Wed, Jan 29, 2014 at 7:38 AM=
, java8964 <span dir=3D"ltr">&lt;<a href=3D"mailto:java8964@hotmail.com" ta=
rget=3D"_blank">java8964@hotmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">


<div><div dir=3D"ltr">Hi, Ognen:<div><br></div><div>I noticed you were aski=
ng this question before under a different subject line. I think you need to=
 tell us where you mean unbalance space, is it on HDFS or the local disk.</=
div>
<div><br></div><div>1) The HDFS is independent as MR. They are not related =
to each other.</div></div></div></blockquote><div><br></div><div>OK good to=
 know.<br>=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div><div dir=3D"ltr"><div>2) Without MR1 or MR2 (Yarn), HDFS should work a=
s itself, which means all HDFS command, API will just work.</div></div></di=
v></blockquote><div><br></div><div>Good to know. Does this also mean that w=
hen I put or distcp file to hdfs://namenode:54310/path/file - it will &quot=
;decide&quot; how to split the file across all the datanodes so as the node=
s are utilized equally in terms of space?<br>
=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div di=
r=3D"ltr"><div>3) But when you tried to copy file into HDFS using distcp, y=
ou need MR component (Doesn&#39;t matter it is MR1 or MR2), as distcp indee=
d uses MapReduce to do the massively parallel copying files.</div>
</div></div></blockquote><div><br></div><div>Understood.<br>=A0<br></div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft:1px solid rgb(204,204,204);padding-left:1ex"><div><div dir=3D"ltr"><div>=
4) Your original problem is that when you run the distcp command, you didn&=
#39;t start the MR component in your cluster, so distcp in fact copy your f=
iles to the LOCAL file system, based on some one else&#39;s reply to your o=
riginal question. I didn&#39;t test this myself before, but I kind of belie=
ve that.=A0</div>
</div></div></blockquote><div><br></div><div>Sure. But even if distcp is ru=
nning in one thread, its destination is hdfs://namenode:54310/path/file - s=
hould this not ensure equal &quot;split&quot; of files across the whole HDF=
S cluster? Or am I delusional? :)<br>
=A0<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div di=
r=3D"ltr"><div>5) If the above is true, then you should see under node your=
 were running distcp command there should be having these files in the loca=
l file system, in the path you specified. You should check and verify that.=
</div>
</div></div></blockquote><div><br></div><div>OK - so the command is this:<b=
r><br>hadoop --config /etc/hadoop distcp s3n://&lt;credentials&gt;@bucket/f=
ile hdfs://<a href=3D"http://10.10.0.198:54310/test/file">10.10.0.198:54310=
/test/file</a> where 10.10.0.198 is the HDFS Name node. I am running this o=
n 10.10.0.200 which is one of the Data nodes and I am making no mention of =
the local data node storage in this command. My expectation is that the fil=
es obtained this way from S3 will end up distributed somewhat evenly across=
 all of the 16 Data nodes in this HDSF cluster. Am I wrong to expect this?<=
br>
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8=
ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div dir=
=3D"ltr"><div>6) After you start yarn/resource manager, you see the unbalan=
ce after you distcp files again. Where is this unbalance? In the HDFS or lo=
cal file system. List the commands =A0and outputs here, so we can understan=
d your problem more clearly, instead of misleading sometimes by your words.=
</div>
</div></div></blockquote><div><br></div><div>The imbalance is as follows: t=
he machine I run the distcp command on (one of the Data nodes) ends up with=
 70+% of the space it is contributing to the HDFS cluster occupied with the=
se files while the rest of the data nodes in the cluster only get 10% of th=
eir contributed space occupied. Since HDFS is a distributed, parallel file =
system I would expect that the file space occupied would be spread evenly o=
r somewhat evenly across all the data nodes.<br>
<br></div><div>Thanks!<br>Ognen<br></div></div></div></div></div>

--20cf307f34b2400b5a04f11c6f6d--