Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF8FA1052D for ; Wed, 29 Jan 2014 14:06:16 +0000 (UTC) Received: (qmail 33067 invoked by uid 500); 29 Jan 2014 14:06:06 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 32420 invoked by uid 500); 29 Jan 2014 14:06:05 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 32413 invoked by uid 99); 29 Jan 2014 14:06:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jan 2014 14:06:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ognen@nengoiksvelzud.com designates 209.85.128.182 as permitted sender) Received: from [209.85.128.182] (HELO mail-ve0-f182.google.com) (209.85.128.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jan 2014 14:05:55 +0000 Received: by mail-ve0-f182.google.com with SMTP id jy13so1234817veb.13 for ; Wed, 29 Jan 2014 06:05:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=X0jQpVOkfWdNpQL3X7uc2bxYbCVr2oGH+rX/xAGe7zQ=; b=ApEmcfNUG9XoGpjy8A5DlZxb8ycRg4rSgEw6eip7RdI0PjkncM8bkdG83sk6nvPy5Z OS0seI4e3yvcZvi1UrzDG6X9QSwKdqsVvD+495L8sgYSilTyYgmOsmBwjjQv0alfZUXF ydqI4z9xcyq/L8Nk+fzt1dXgw1SrqL6eAw9iVyzgkLxWqBPYO0Cl+M+J5OEeu+mre3JJ whPpOGFeLprDblIHKXJIUHxHIF+3JOCotaxdtZtEjH+WO57WT3Qf1y3zE2W2Hkti5Acr IjKLKxq7DdJwD2pzKYCh/evD7yJahBGbbzvEehMzhWQ7ZUhBy2hPzT6hnT28hgLUy/S4 Zaxw== X-Gm-Message-State: ALoCoQlyxt2XTpebzwpPJ2EvpekabMSfAXprhRwEEg3kcucRB3vv69wcGgO5woftlg9JNHRh/6Zm MIME-Version: 1.0 X-Received: by 10.52.99.227 with SMTP id et3mr3334vdb.53.1391004334671; Wed, 29 Jan 2014 06:05:34 -0800 (PST) Received: by 10.58.127.97 with HTTP; Wed, 29 Jan 2014 06:05:34 -0800 (PST) X-Originating-IP: [54.194.45.237] In-Reply-To: References: Date: Wed, 29 Jan 2014 08:05:34 -0600 Message-ID: Subject: Re: Configuring hadoop 2.2.0 From: Ognen Duzlevski To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf307f34b2400b5a04f11c6f6d X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f34b2400b5a04f11c6f6d Content-Type: text/plain; charset=ISO-8859-1 Hello (and thanks for replying!) :) On Wed, Jan 29, 2014 at 7:38 AM, java8964 wrote: > Hi, Ognen: > > I noticed you were asking this question before under a different subject > line. I think you need to tell us where you mean unbalance space, is it on > HDFS or the local disk. > > 1) The HDFS is independent as MR. They are not related to each other. > OK good to know. > 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all > HDFS command, API will just work. > Good to know. Does this also mean that when I put or distcp file to hdfs://namenode:54310/path/file - it will "decide" how to split the file across all the datanodes so as the nodes are utilized equally in terms of space? > 3) But when you tried to copy file into HDFS using distcp, you need MR > component (Doesn't matter it is MR1 or MR2), as distcp indeed uses > MapReduce to do the massively parallel copying files. > Understood. > 4) Your original problem is that when you run the distcp command, you > didn't start the MR component in your cluster, so distcp in fact copy your > files to the LOCAL file system, based on some one else's reply to your > original question. I didn't test this myself before, but I kind of believe > that. > Sure. But even if distcp is running in one thread, its destination is hdfs://namenode:54310/path/file - should this not ensure equal "split" of files across the whole HDFS cluster? Or am I delusional? :) > 5) If the above is true, then you should see under node your were running > distcp command there should be having these files in the local file system, > in the path you specified. You should check and verify that. > OK - so the command is this: hadoop --config /etc/hadoop distcp s3n://@bucket/file hdfs:// 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am running this on 10.10.0.200 which is one of the Data nodes and I am making no mention of the local data node storage in this command. My expectation is that the files obtained this way from S3 will end up distributed somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I wrong to expect this? 6) After you start yarn/resource manager, you see the unbalance after you > distcp files again. Where is this unbalance? In the HDFS or local file > system. List the commands and outputs here, so we can understand your > problem more clearly, instead of misleading sometimes by your words. > The imbalance is as follows: the machine I run the distcp command on (one of the Data nodes) ends up with 70+% of the space it is contributing to the HDFS cluster occupied with these files while the rest of the data nodes in the cluster only get 10% of their contributed space occupied. Since HDFS is a distributed, parallel file system I would expect that the file space occupied would be spread evenly or somewhat evenly across all the data nodes. Thanks! Ognen --20cf307f34b2400b5a04f11c6f6d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM= , java8964 <java8964@hotmail.com> wrote:
Hi, Ognen:

I noticed you were aski= ng this question before under a different subject line. I think you need to= tell us where you mean unbalance space, is it on HDFS or the local disk.

1) The HDFS is independent as MR. They are not related = to each other.

OK good to= know.
=A0
2) Without MR1 or MR2 (Yarn), HDFS should work a= s itself, which means all HDFS command, API will just work.

Good to know. Does this also mean that w= hen I put or distcp file to hdfs://namenode:54310/path/file - it will "= ;decide" how to split the file across all the datanodes so as the node= s are utilized equally in terms of space?
=A0
3) But when you tried to copy file into HDFS using distcp, y= ou need MR component (Doesn't matter it is MR1 or MR2), as distcp indee= d uses MapReduce to do the massively parallel copying files.

Understood.
=A0
= 4) Your original problem is that when you run the distcp command, you didn&= #39;t start the MR component in your cluster, so distcp in fact copy your f= iles to the LOCAL file system, based on some one else's reply to your o= riginal question. I didn't test this myself before, but I kind of belie= ve that.=A0

Sure. But even if distcp is ru= nning in one thread, its destination is hdfs://namenode:54310/path/file - s= hould this not ensure equal "split" of files across the whole HDF= S cluster? Or am I delusional? :)
=A0
5) If the above is true, then you should see under node your= were running distcp command there should be having these files in the loca= l file system, in the path you specified. You should check and verify that.=

OK - so the command is this:
hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/f= ile hdfs://10.10.0.198:54310= /test/file where 10.10.0.198 is the HDFS Name node. I am running this o= n 10.10.0.200 which is one of the Data nodes and I am making no mention of = the local data node storage in this command. My expectation is that the fil= es obtained this way from S3 will end up distributed somewhat evenly across= all of the 16 Data nodes in this HDSF cluster. Am I wrong to expect this?<= br>
6) After you start yarn/resource manager, you see the unbalan= ce after you distcp files again. Where is this unbalance? In the HDFS or lo= cal file system. List the commands =A0and outputs here, so we can understan= d your problem more clearly, instead of misleading sometimes by your words.=

The imbalance is as follows: t= he machine I run the distcp command on (one of the Data nodes) ends up with= 70+% of the space it is contributing to the HDFS cluster occupied with the= se files while the rest of the data nodes in the cluster only get 10% of th= eir contributed space occupied. Since HDFS is a distributed, parallel file = system I would expect that the file space occupied would be spread evenly o= r somewhat evenly across all the data nodes.

Thanks!
Ognen
--20cf307f34b2400b5a04f11c6f6d--