Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com
 designates 74.125.149.77 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAC_RXPFRPuE_gtURwNBF1FvE42nWzxfCuRj29j8Tb=+j-hbNZw@mail.gmail.com>
References: 
 <CAC_RXPFRPuE_gtURwNBF1FvE42nWzxfCuRj29j8Tb=+j-hbNZw@mail.gmail.com>
From: Bryan Beaudreault <bbeaudreault@hubspot.com>
Date: Tue, 28 Jan 2014 11:51:33 -0500
Message-ID: 
 <CANZDn9v-B1Dfp+6DTWTw95b09LxW6OjGO5S5_deUMUs-VVr=rw@mail.gmail.com>
Subject: Re: HDFS question
To: "hbase-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c2209e2f15c904f10aa40f

--001a11c2209e2f15c904f10aa40f
Content-Type: text/plain; charset=ISO-8859-1

Do you have a jobtracker?  Without a jobtracker and tasktrackers, distcp is
running in LocalRunner mode.  I.E. it is running a single-threaded process
on the local machine.  The default behavior of the DFSClient is to write
data locally first, with replicas being placed off-rack then on-rack.

This would explain why everything seems to be going locally, it is also
probably much slower than it could be.


On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski
<ognen@nengoiksvelzud.com>wrote:

> Hello,
>
> I am new to Hadoop and HDFS so maybe I am not understanding things
> properly but I have the following issue:
>
> I have set up a name node and a bunch of data nodes for HDFS. Each node
> contributes 1.6TB of space so the total space shown on the hdfs web front
> end is about 25TB. I have set the replication to be 3.
>
> I am downloading large files on a single data node from Amazon's S3 using
> the -distcp command - like this:
>
>  hadoop --config /etc/hadoop distcp
> s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json
> hdfs://10.10.0.198:54310/test/2013-12-03.json
>
> Where 10.10.0.198 is the Hadoop Name node.
>
> All I am getting is that the machine I am running these commands on (one
> of the data nodes) is getting all the files - they do not seem to be
> "spreading" around the HDFS cluster.
>
> Is this expected? Did I completely misunderstand the point of a parallel
> DISTRIBUTED file system? :)
>
> Thanks!
> Ognen
>

--001a11c2209e2f15c904f10aa40f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Do you have a jobtracker? =A0Without a jobtracker and task=
trackers, distcp is running in LocalRunner mode. =A0I.E. it is running a si=
ngle-threaded process on the local machine. =A0The default behavior of the =
DFSClient is to write data locally first, with replicas being placed off-ra=
ck then on-rack.<div>

<br></div><div>This would explain why everything seems to be going locally,=
 it is also probably much slower than it could be.</div></div><div class=3D=
"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, Jan 28, 2014 at 11=
:42 AM, Ognen Duzlevski <span dir=3D"ltr">&lt;<a href=3D"mailto:ognen@nengo=
iksvelzud.com" target=3D"_blank">ognen@nengoiksvelzud.com</a>&gt;</span> wr=
ote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>Hello,<br><br>I am new to Hadoop and HDFS so maybe I am not understandin=
g things properly but I have the following issue:<br>

<br></div>I have set up a name node and a bunch of data nodes for HDFS. Eac=
h node contributes 1.6TB of space so the total space shown on the hdfs web =
front end is about 25TB. I have set the replication to be 3.<br>
<br></div>I am downloading large files on a single data node from Amazon=
9;s S3 using the -distcp command - like this:<br><br>=A0hadoop --config /et=
c/hadoop distcp s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6=
yKtDBLPp@data-pipeline/large_data/2013-12-02.json hdfs://<a href=3D"http://=
10.10.0.198:54310/test/2013-12-03.json" target=3D"_blank">10.10.0.198:54310=
/test/2013-12-03.json</a><br>


<br></div>Where 10.10.0.198 is the Hadoop Name node.<br><br></div>All I am =
getting is that the machine I am running these commands on (one of the data=
 nodes) is getting all the files - they do not seem to be &quot;spreading&q=
uot; around the HDFS cluster.<br>


<br></div>Is this expected? Did I completely misunderstand the point of a p=
arallel DISTRIBUTED file system? :)<br><br></div>Thanks!<span class=3D"HOEn=
Zb"><font color=3D"#888888"><br>Ognen<br></font></span></div>
</blockquote></div><br></div>

--001a11c2209e2f15c904f10aa40f--