Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of ognen@nengoiksvelzud.com
 designates 209.85.128.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANZDn9v-B1Dfp+6DTWTw95b09LxW6OjGO5S5_deUMUs-VVr=rw@mail.gmail.com>
References: 
 <CAC_RXPFRPuE_gtURwNBF1FvE42nWzxfCuRj29j8Tb=+j-hbNZw@mail.gmail.com>
	<CANZDn9v-B1Dfp+6DTWTw95b09LxW6OjGO5S5_deUMUs-VVr=rw@mail.gmail.com>
Date: Tue, 28 Jan 2014 17:19:07 -0600
Message-ID: 
 <CAC_RXPGsxwoNv1Ugp4hX0B6ydXSC5nnGouO1EXeY5WzuoRqX=g@mail.gmail.com>
Subject: Re: HDFS question
From: Ognen Duzlevski <ognen@nengoiksvelzud.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b342d3010b19604f1100da8

--047d7b342d3010b19604f1100da8
Content-Type: text/plain; charset=ISO-8859-1

OK - I set up a ResourceManager node with a bunch of NodeManager slaves.

The set up is as follows:

HDFS: machine X is a Name node, it has 16 slaves (IPs: x.x.x.200-215)

Resources: machine Y is a Resource manager, it has 16 of the same slaves
(IPs: x.x.x.200-215) as Node manager slaves.

If I start the distcp from S3 on machine x.x.x.200 - the filesystem is
still filling up only on that machine. How do I get this to work? What am I
missing? :)

Thanks!
Ognen


On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault <
bbeaudreault@hubspot.com> wrote:

> Do you have a jobtracker?  Without a jobtracker and tasktrackers, distcp
> is running in LocalRunner mode.  I.E. it is running a single-threaded
> process on the local machine.  The default behavior of the DFSClient is to
> write data locally first, with replicas being placed off-rack then on-rack.
>
> This would explain why everything seems to be going locally, it is also
> probably much slower than it could be.
>
>
> On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski <
> ognen@nengoiksvelzud.com> wrote:
>
>> Hello,
>>
>> I am new to Hadoop and HDFS so maybe I am not understanding things
>> properly but I have the following issue:
>>
>> I have set up a name node and a bunch of data nodes for HDFS. Each node
>> contributes 1.6TB of space so the total space shown on the hdfs web front
>> end is about 25TB. I have set the replication to be 3.
>>
>> I am downloading large files on a single data node from Amazon's S3 using
>> the -distcp command - like this:
>>
>>  hadoop --config /etc/hadoop distcp
>> s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json
>> hdfs://10.10.0.198:54310/test/2013-12-03.json
>>
>> Where 10.10.0.198 is the Hadoop Name node.
>>
>> All I am getting is that the machine I am running these commands on (one
>> of the data nodes) is getting all the files - they do not seem to be
>> "spreading" around the HDFS cluster.
>>
>> Is this expected? Did I completely misunderstand the point of a parallel
>> DISTRIBUTED file system? :)
>>
>> Thanks!
>> Ognen
>>
>
>

--047d7b342d3010b19604f1100da8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>OK - I set up a ResourceManager n=
ode with a bunch of NodeManager slaves.<br><br></div>The set up is as follo=
ws:<br><br></div>HDFS: machine X is a Name node, it has 16 slaves (IPs: x.x=
.x.200-215)<br>
<br></div>Resources: machine Y is a Resource manager, it has 16 of the same=
 slaves (IPs: x.x.x.200-215) as Node manager slaves.<br><br></div>If I star=
t the distcp from S3 on machine x.x.x.200 - the filesystem is still filling=
 up only on that machine. How do I get this to work? What am I missing? :)<=
br>
<br></div>Thanks!<br>Ognen<br></div><div class=3D"gmail_extra"><br><br><div=
 class=3D"gmail_quote">On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault =
<span dir=3D"ltr">&lt;<a href=3D"mailto:bbeaudreault@hubspot.com" target=3D=
"_blank">bbeaudreault@hubspot.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Do you have a jobtracker? =
=A0Without a jobtracker and tasktrackers, distcp is running in LocalRunner =
mode. =A0I.E. it is running a single-threaded process on the local machine.=
 =A0The default behavior of the DFSClient is to write data locally first, w=
ith replicas being placed off-rack then on-rack.<div>


<br></div><div>This would explain why everything seems to be going locally,=
 it is also probably much slower than it could be.</div></div><div class=3D=
"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D=
"gmail_quote">
On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski <span dir=3D"ltr">&lt;<a =
href=3D"mailto:ognen@nengoiksvelzud.com" target=3D"_blank">ognen@nengoiksve=
lzud.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>Hello,<br><br>I am new to Hadoop and HDFS so maybe I am not understandin=
g things properly but I have the following issue:<br>


<br></div>I have set up a name node and a bunch of data nodes for HDFS. Eac=
h node contributes 1.6TB of space so the total space shown on the hdfs web =
front end is about 25TB. I have set the replication to be 3.<br>
<br></div>I am downloading large files on a single data node from Amazon=
9;s S3 using the -distcp command - like this:<br><br>=A0hadoop --config /et=
c/hadoop distcp s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6=
yKtDBLPp@data-pipeline/large_data/2013-12-02.json hdfs://<a href=3D"http://=
10.10.0.198:54310/test/2013-12-03.json" target=3D"_blank">10.10.0.198:54310=
/test/2013-12-03.json</a><br>


<br></div>Where 10.10.0.198 is the Hadoop Name node.<br><br></div>All I am =
getting is that the machine I am running these commands on (one of the data=
 nodes) is getting all the files - they do not seem to be &quot;spreading&q=
uot; around the HDFS cluster.<br>


<br></div>Is this expected? Did I completely misunderstand the point of a p=
arallel DISTRIBUTED file system? :)<br><br></div>Thanks!<span><font color=
=3D"#888888"><br>Ognen<br></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7b342d3010b19604f1100da8--