Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 690C110E6A for ; Tue, 28 Jan 2014 23:19:41 +0000 (UTC) Received: (qmail 37866 invoked by uid 500); 28 Jan 2014 23:19:33 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 37781 invoked by uid 500); 28 Jan 2014 23:19:33 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 37774 invoked by uid 99); 28 Jan 2014 23:19:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 23:19:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ognen@nengoiksvelzud.com designates 209.85.128.180 as permitted sender) Received: from [209.85.128.180] (HELO mail-ve0-f180.google.com) (209.85.128.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 23:19:28 +0000 Received: by mail-ve0-f180.google.com with SMTP id db12so694112veb.25 for ; Tue, 28 Jan 2014 15:19:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=4/ll1PYtBZ5tpii6C/g7TkIkhhdlpez2nUp1FQrvP7Y=; b=ddUo1xCKYZCypoZR7krH9DTQWsx8+Xf/id3mDa3z+bnS0l/FsdsCCyN3/fhwYreonp 4Bs3zrvs2p3OskfILBOxPV5vrEqaWluSn4iGwzUKsU6HKQTu8Q6PATwIMXXg5NER+JVf GsEUPQdr2zeWQl47ZoNLwXZcbb0IiIm9lKOd+Vc9Oz1FY/y9xQ8FxeCmnGBQj7xY3bC5 HKX+6CfZYVwgClJ1NY8vLqwshUe2E+ZENDO/s4dOboR6RHCePXy+nGtQDE/KQ/6NI4ar iBgjOYzZ8M2rAihCZ+o0875cYR3RKpDlQn9GMAFZYmRLuV+ph8lgh2dOtaog0XHWCqaz rD0Q== X-Gm-Message-State: ALoCoQnSCnsqt1RONujLHRRbdmeW6fjB1xsB0eC7/KydKgeyLrBrC9FAxdScYGqZvAzrIlja65Uk MIME-Version: 1.0 X-Received: by 10.220.103.141 with SMTP id k13mr3321883vco.25.1390951147806; Tue, 28 Jan 2014 15:19:07 -0800 (PST) Received: by 10.58.127.97 with HTTP; Tue, 28 Jan 2014 15:19:07 -0800 (PST) X-Originating-IP: [54.194.45.237] In-Reply-To: References: Date: Tue, 28 Jan 2014 17:19:07 -0600 Message-ID: Subject: Re: HDFS question From: Ognen Duzlevski To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b342d3010b19604f1100da8 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b342d3010b19604f1100da8 Content-Type: text/plain; charset=ISO-8859-1 OK - I set up a ResourceManager node with a bunch of NodeManager slaves. The set up is as follows: HDFS: machine X is a Name node, it has 16 slaves (IPs: x.x.x.200-215) Resources: machine Y is a Resource manager, it has 16 of the same slaves (IPs: x.x.x.200-215) as Node manager slaves. If I start the distcp from S3 on machine x.x.x.200 - the filesystem is still filling up only on that machine. How do I get this to work? What am I missing? :) Thanks! Ognen On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault < bbeaudreault@hubspot.com> wrote: > Do you have a jobtracker? Without a jobtracker and tasktrackers, distcp > is running in LocalRunner mode. I.E. it is running a single-threaded > process on the local machine. The default behavior of the DFSClient is to > write data locally first, with replicas being placed off-rack then on-rack. > > This would explain why everything seems to be going locally, it is also > probably much slower than it could be. > > > On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski < > ognen@nengoiksvelzud.com> wrote: > >> Hello, >> >> I am new to Hadoop and HDFS so maybe I am not understanding things >> properly but I have the following issue: >> >> I have set up a name node and a bunch of data nodes for HDFS. Each node >> contributes 1.6TB of space so the total space shown on the hdfs web front >> end is about 25TB. I have set the replication to be 3. >> >> I am downloading large files on a single data node from Amazon's S3 using >> the -distcp command - like this: >> >> hadoop --config /etc/hadoop distcp >> s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json >> hdfs://10.10.0.198:54310/test/2013-12-03.json >> >> Where 10.10.0.198 is the Hadoop Name node. >> >> All I am getting is that the machine I am running these commands on (one >> of the data nodes) is getting all the files - they do not seem to be >> "spreading" around the HDFS cluster. >> >> Is this expected? Did I completely misunderstand the point of a parallel >> DISTRIBUTED file system? :) >> >> Thanks! >> Ognen >> > > --047d7b342d3010b19604f1100da8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
OK - I set up a ResourceManager n= ode with a bunch of NodeManager slaves.

The set up is as follo= ws:

HDFS: machine X is a Name node, it has 16 slaves (IPs: x.x= .x.200-215)

Resources: machine Y is a Resource manager, it has 16 of the same= slaves (IPs: x.x.x.200-215) as Node manager slaves.

If I star= t the distcp from S3 on machine x.x.x.200 - the filesystem is still filling= up only on that machine. How do I get this to work? What am I missing? :)<= br>
Thanks!
Ognen


On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault = <bbeaudreault@hubspot.com> wrote:
Do you have a jobtracker? = =A0Without a jobtracker and tasktrackers, distcp is running in LocalRunner = mode. =A0I.E. it is running a single-threaded process on the local machine.= =A0The default behavior of the DFSClient is to write data locally first, w= ith replicas being placed off-rack then on-rack.

This would explain why everything seems to be going locally,= it is also probably much slower than it could be.


On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski <ognen@nengoiksve= lzud.com> wrote:
Hello,

I am new to Hadoop and HDFS so maybe I am not understandin= g things properly but I have the following issue:

I have set up a name node and a bunch of data nodes for HDFS. Eac= h node contributes 1.6TB of space so the total space shown on the hdfs web = front end is about 25TB. I have set the replication to be 3.

I am downloading large files on a single data node from Amazon= 9;s S3 using the -distcp command - like this:

=A0hadoop --config /et= c/hadoop distcp s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6= yKtDBLPp@data-pipeline/large_data/2013-12-02.json hdfs://10.10.0.198:54310= /test/2013-12-03.json

Where 10.10.0.198 is the Hadoop Name node.

All I am = getting is that the machine I am running these commands on (one of the data= nodes) is getting all the files - they do not seem to be "spreading&q= uot; around the HDFS cluster.

Is this expected? Did I completely misunderstand the point of a p= arallel DISTRIBUTED file system? :)

Thanks!
Ognen


--047d7b342d3010b19604f1100da8--