Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4DD4C1023F for ; Mon, 26 May 2014 18:09:22 +0000 (UTC) Received: (qmail 43691 invoked by uid 500); 26 May 2014 18:09:18 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 43559 invoked by uid 500); 26 May 2014 18:09:18 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 43552 invoked by uid 99); 26 May 2014 18:09:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 May 2014 18:09:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sindhuht@gmail.com designates 74.125.82.182 as permitted sender) Received: from [74.125.82.182] (HELO mail-we0-f182.google.com) (74.125.82.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 May 2014 18:09:12 +0000 Received: by mail-we0-f182.google.com with SMTP id t60so8547630wes.27 for ; Mon, 26 May 2014 11:08:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=GbSkNgKGqsDduKNkX8qF+hdKTjtTONyA0HkofOIDOJU=; b=0WOMzb55erv/E/Hpoj3UHHRZnLwGfK5fBWdIbBAwEyi1m/nU99bKQgx0xNis98ra1q QHXnU8iyz4Y5GOQhIfW6ZiMb645xSXPcFEU+pbhxrcLLy2ITLnrC6hbiryvvtKx54h8t ERepPaoEp95VatNL/rZ0ZgaShBeTQZEgDh4/zY68Uv6b+ZnVVk06pAVwbiVm8fx8YV0H 2gOfPNOoTMLnCZYDRV/kN1fVr6HM0lynibUubI6LPb22AQIXgYr3h2TdLEIFzkhs56pi IJejLhfdCmESxTSsSkkSDcS0cJD0xnuSafl2Q/um4Ebhd8PVkECJhQG1do7vPeAVICQr dZQA== X-Received: by 10.180.105.72 with SMTP id gk8mr29991265wib.32.1401127728253; Mon, 26 May 2014 11:08:48 -0700 (PDT) Received: from [192.168.1.6] (g225245145.adsl.alicedsl.de. [92.225.245.145]) by mx.google.com with ESMTPSA id en6sm1669669wib.11.2014.05.26.11.08.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 26 May 2014 11:08:47 -0700 (PDT) From: Sindhu Hosamane Content-Type: multipart/alternative; boundary="Apple-Mail=_F8E2BDEB-16FA-44FC-99DB-FD55236F224F" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.2\)) Subject: Re: How to make sure data blocks are shared between 2 datanodes Date: Mon, 26 May 2014 20:08:44 +0200 References: To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1878.2) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_F8E2BDEB-16FA-44FC-99DB-FD55236F224F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 ok .thanks for that information .=20 As i said i am running 2 datanodes on same machine . so my haddop home = has 2 conf folders . conf and conf2 and in turn 2 hdfs-site.xml in both conf folders . I guess dfs.replication value in hdfs-site.xml of conf folder should be = 3 . What should i have it in conf2 ? should it be 1 there ? sorry if question sounds stupid . But i unfamiliar with these kind of = settings ( 2 datanodes on same machine ..so having 2 conf ) If data is split across multiple datanodes , then processing capacity = would be improved - ( thats what i guess ) since my file is only 240 KB = , it occupies only one block . It cannot use second block and remain in = another datanode .=20 So now , does it make sense to reduce the block size so that blocks are = split between 2 datanodes =97if i want to take very much advantage of = multiple datanodes . Best Regards, Sindhu On 25 May 2014, at 21:47, Peyman Mohajerian wrote: > Block size are typically 64 M or 12 M, so in your case only a single = block is involved which means if you have a single replica then only a = single data node will be used. The default replication is three and = since you only have two data nodes, you will most likely have two copies = of the data in two separate data nodes. >=20 >=20 > On Sun, May 25, 2014 at 12:40 PM, Sindhu Hosamane = wrote: >=20 >>> Hello Friends,=20 >>>=20 >>> I am running multiple datanodes on a single machine . >>>=20 >>> The output of jps command shows=20 >>> Namenode Datanode Datanode Jobtracker tasktracker = Secondary Namenode >>>=20 >>> Which assures that 2 datanodes are up and running .I execute = cascalog queries on this 2 datanode hadoop cluster , And i get the = results of query too. >>> I am not sure if it is really using both datanodes . ( bcoz anyways = i get results with one datanode ) >>>=20 >>> (read somewhere about HDFS storing data in datanodes like below ) >>> 1) A HDFS scheme might automatically move data from one DataNode to = another if the free space on a DataNode falls below a certain threshold.=20= >>> 2) Internally, a file is split into one or more blocks and these = blocks are stored in a set of DataNodes.=20 >>>=20 >>> My doubts are : >>> * Do i have to make any configuration changes in hadoop to tell it = to share datablocks between 2 datanodes or does it do automatically . >>> * Also My test data is not too big . its only 240 KB . According to = point 1) i don't know if such small test data can initiate automatic = movement of data from one datanode to another . >>> * Also what should dfs.replication value be when i am running 2 = datanodes ? (i guess its 2 ) >>>=20 >>>=20 >>> Any advice or help would be very much appreciated . >>>=20 >>> Best Regards, >>> Sindhu >>=20 >>=20 >=20 >=20 --Apple-Mail=_F8E2BDEB-16FA-44FC-99DB-FD55236F224F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252

ok .thanks for that information = . 
As i said i am running  2 datanodes on same = machine . so my haddop home has 2 conf folders .
conf and = conf2  and in turn 2 hdfs-site.xml in both conf folders = .
I guess dfs.replication value in hdfs-site.xml of conf = folder should be 3 .
What should i have it in conf2  ? = should it be 1 there ?

sorry if question sounds = stupid . But i unfamiliar with these kind of settings ( 2 datanodes on = same machine ..so having 2 conf = )


 If data is split across = multiple datanodes , then processing capacity would be improved - ( = thats what i guess ) since my file is only 240 KB , it occupies only one = block . It cannot use second block and remain in another datanode = . 
So now , does it make sense to reduce the block size = so that blocks are split between 2 datanodes =97if i want to take very = much advantage of multiple datanodes = .


Best = Regards,
Sindhu


On 25 May = 2014, at 21:47, Peyman Mohajerian <mohajeri@gmail.com> = wrote:

Block size are typically 64 M or 12 M, so = in your case only a single block is involved which means if you have a = single replica then only a single data node will be used. The default = replication is three and since you only have two data nodes, you will = most likely have two copies of the data in two separate data nodes.


On = Sun, May 25, 2014 at 12:40 PM, Sindhu Hosamane <sindhuht@gmail.com> wrote:

Hello Friends, 

I am running  multiple = datanodes on a single machine .

The = output of jps command shows 
Nameno= de       Datanode     Datanode     = Jobtracker     tasktracker       =  Secondary Namenode

Which = assures that 2 datanodes are up and running .I execute cascalog queries = on this 2 datanode hadoop cluster  , And i get the results of query = too.
I am not sure if it is really using both = datanodes . ( bcoz anyways i get results with one datanode )

(read somewhere about HDFS storing data in datanodes like below = )
1)  A HDFS scheme might automatically move data = from one DataNode to another if the free space on a DataNode falls below = a certain threshold. 
2)  Internally, a file is split into one or more = blocks and these blocks are stored in a set of DataNodes. 

My doubts are :
* Do i have to make any configuration changes = in hadoop to tell it to share datablocks between 2 datanodes or does it = do automatically .
* Also My test data is not too big . its only = 240 KB . According to point 1) i don't know if such small test = data can initiate automatic movement of  data from one datanode to = another .
* Also what should dfs.replication =  value be when i am running 2 datanodes  ?  (i guess its = 2 )


Any advice or help would be very much = appreciated .

Best Regards,
Sindhu

=




= --Apple-Mail=_F8E2BDEB-16FA-44FC-99DB-FD55236F224F--