Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED7B010A2E for ; Tue, 28 Jan 2014 16:52:29 +0000 (UTC) Received: (qmail 99748 invoked by uid 500); 28 Jan 2014 16:52:22 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 99583 invoked by uid 500); 28 Jan 2014 16:52:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 99576 invoked by uid 99); 28 Jan 2014 16:52:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 16:52:22 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_MED,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bbeaudreault@hubspot.com designates 74.125.149.77 as permitted sender) Received: from [74.125.149.77] (HELO na3sys009aog106.obsmtp.com) (74.125.149.77) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 28 Jan 2014 16:52:15 +0000 Received: from mail-vc0-f179.google.com ([209.85.220.179]) (using TLSv1) by na3sys009aob106.postini.com ([74.125.148.12]) with SMTP ID DSNKUufgKWGYSzdAPcuTRzheoamYjL8ArlEQ@postini.com; Tue, 28 Jan 2014 08:51:54 PST Received: by mail-vc0-f179.google.com with SMTP id lh14so399984vcb.24 for ; Tue, 28 Jan 2014 08:51:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=2AwHrVT8XdJd6O+bFGhP+aZPWNI22bu941DpFKUnpxg=; b=IGastIAOSFiBFWVrWoo5kJbKYN/PnZj8Xuyxz0H2KdqD96apA+StgFsomwi0kc+fxw Mg0ZnAcEA1T/qKPBvIFsZXEVHPldVSyICY/4w/Xxi1elCoaxV34wiurCTi5p3clgOShh 7mBIRZqf882vetQJ3JKbcjUFYhBpTRHrCGPcbWvl0IZ61Y5Mvuq+uCLD1aUJb6WQI789 EcJJHlwOXkHBkwVhfS9lJCvbeklrvPh5F0QRIDux3+I13dL/Q9tmbQG8C70mEdudRTP5 oTJPd2bkmLur5jqCHzB91Lp/VI1YWVyJ4JFSFqMMghvnUWnZoVjc0eEfcoB2Ylmibn5d RVtg== X-Gm-Message-State: ALoCoQnHIA237mFwDvs0QA1u5YdHeo4juIwDH8gTIKkNJQOzkOfLJ8zGeHtHhe4Jv0J+VrPs/yDodKozfn2T5Oj+NS4Lps78XvRjBJO6TfZsoI4X6pk/7UmsM8qc42KNQ7H9QEDmEmAXFMzjfvdRcelY5YYkGF1MH7/q0I54dlGEA5+W7HMQzZU= X-Received: by 10.220.161.132 with SMTP id r4mr666454vcx.29.1390927913443; Tue, 28 Jan 2014 08:51:53 -0800 (PST) X-Received: by 10.220.161.132 with SMTP id r4mr666450vcx.29.1390927913350; Tue, 28 Jan 2014 08:51:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.43.148 with HTTP; Tue, 28 Jan 2014 08:51:33 -0800 (PST) In-Reply-To: References: From: Bryan Beaudreault Date: Tue, 28 Jan 2014 11:51:33 -0500 Message-ID: Subject: Re: HDFS question To: "hbase-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c2209e2f15c904f10aa40f X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2209e2f15c904f10aa40f Content-Type: text/plain; charset=ISO-8859-1 Do you have a jobtracker? Without a jobtracker and tasktrackers, distcp is running in LocalRunner mode. I.E. it is running a single-threaded process on the local machine. The default behavior of the DFSClient is to write data locally first, with replicas being placed off-rack then on-rack. This would explain why everything seems to be going locally, it is also probably much slower than it could be. On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski wrote: > Hello, > > I am new to Hadoop and HDFS so maybe I am not understanding things > properly but I have the following issue: > > I have set up a name node and a bunch of data nodes for HDFS. Each node > contributes 1.6TB of space so the total space shown on the hdfs web front > end is about 25TB. I have set the replication to be 3. > > I am downloading large files on a single data node from Amazon's S3 using > the -distcp command - like this: > > hadoop --config /etc/hadoop distcp > s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json > hdfs://10.10.0.198:54310/test/2013-12-03.json > > Where 10.10.0.198 is the Hadoop Name node. > > All I am getting is that the machine I am running these commands on (one > of the data nodes) is getting all the files - they do not seem to be > "spreading" around the HDFS cluster. > > Is this expected? Did I completely misunderstand the point of a parallel > DISTRIBUTED file system? :) > > Thanks! > Ognen > --001a11c2209e2f15c904f10aa40f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Do you have a jobtracker? =A0Without a jobtracker and task= trackers, distcp is running in LocalRunner mode. =A0I.E. it is running a si= ngle-threaded process on the local machine. =A0The default behavior of the = DFSClient is to write data locally first, with replicas being placed off-ra= ck then on-rack.

This would explain why everything seems to be going locally,= it is also probably much slower than it could be.


On Tue, Jan 28, 2014 at 11= :42 AM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wr= ote:
Hello,

I am new to Hadoop and HDFS so maybe I am not understandin= g things properly but I have the following issue:

I have set up a name node and a bunch of data nodes for HDFS. Eac= h node contributes 1.6TB of space so the total space shown on the hdfs web = front end is about 25TB. I have set the replication to be 3.

I am downloading large files on a single data node from Amazon= 9;s S3 using the -distcp command - like this:

=A0hadoop --config /et= c/hadoop distcp s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6= yKtDBLPp@data-pipeline/large_data/2013-12-02.json hdfs://10.10.0.198:54310= /test/2013-12-03.json

Where 10.10.0.198 is the Hadoop Name node.

All I am = getting is that the machine I am running these commands on (one of the data= nodes) is getting all the files - they do not seem to be "spreading&q= uot; around the HDFS cluster.

Is this expected? Did I completely misunderstand the point of a p= arallel DISTRIBUTED file system? :)

Thanks!
Ognen

--001a11c2209e2f15c904f10aa40f--