Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 54371 invoked from network); 9 Mar 2010 18:26:35 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Mar 2010 18:26:35 -0000 Received: (qmail 7511 invoked by uid 500); 9 Mar 2010 18:26:06 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 7474 invoked by uid 500); 9 Mar 2010 18:26:06 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 7465 invoked by uid 99); 9 Mar 2010 18:26:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Mar 2010 18:26:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.180.197.112] (HELO web45108.mail.sp1.yahoo.com) (68.180.197.112) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 09 Mar 2010 18:26:03 +0000 Received: (qmail 17896 invoked by uid 60001); 9 Mar 2010 18:25:43 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1268159142; bh=Sck8g9AR3XrrFuTbLc451Uu1pHVTaemfmVDa1LTpCQo=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type; b=4JPW7CFt/B/ZzQ4gGotMlNCeWE07TC78tRvOwMBA4JwAH3UyrQo7BCaMRlEd+CvB7xHYrEWyEizsv9jRpwnYPDqnlpVC7Dq5JQLkIqSssTpE7ryLj1+LyMy+mTKEaeESuwxbKInrUMsMXaDF9dlxqZ706utb1N2wmQSWIQYOqcE= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type; b=brcMlkUo4A9kK8IrwDwDa7JpfIdUydQ435GffbbLt6DNUWijCjdGORGNd+vIeLfAiY3lz5UXqOu5ChvvkxxhbcIFbP8aYOB7uIEXh3E5yGWH2XjZsT7jrfL2WzcfAuGRyiTtBZ8uFgybMZHbQDifO1pLIT/Ikw8ac2I5Z+5tnRI=; Message-ID: <932883.85799.qm@web45108.mail.sp1.yahoo.com> X-YMail-OSG: 3grk7FkVM1le3CWLB.s222poXNbvHQ1JDyK2C70LzcmCqqO 2Gte39YODI8DJYXEtSvM4351BO2Y3.NbQCgfN86del_IKaMMfEEPwXnP3ars r9BrpCU7v1TMVzmQH6hxJUTUDhDD_FXMOsBygpDk7zoU_qc6GNcl_B19_Zuz qArj0HHmvQa_dwSaZ9DBTuA2ANTG7YT7a7GduFW0NEFdtKGywv1XyxoQXCAJ b7iiptH7MBx0S46ao1EdSdmtffvvPPXnl6HC.JR08wede6z.CdlDvEO702bJ QDDrjEtcpkW_Ruii0p0jkjjbIL0KUtSEoGvxK Received: from [71.6.110.222] by web45108.mail.sp1.yahoo.com via HTTP; Tue, 09 Mar 2010 10:25:42 PST X-Mailer: YahooMailClassic/9.2.12 YahooMailWebService/0.8.100.260964 Date: Tue, 9 Mar 2010 10:25:42 -0800 (PST) From: jiang licht Subject: Re: How namenode/datanode and jobtracker/tasktracker negotiate connection? To: common-user@hadoop.apache.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-125310353-1268159142=:85799" --0-125310353-1268159142=:85799 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable --- On Tue, 3/9/10, Steve Loughran wrote: > From: Steve Loughran > Subject: Re: How namenode/datanode and jobtracker/tasktracker negotiate c= onnection? > To: common-user@hadoop.apache.org > Date: Tuesday, March 9, 2010, 7:05 AM > jiang licht wrote: > > What are the exact packets and steps used to establish > a namenode/datanode connection and=A0 > jobtracker/tasktracker connection? > >=20 > > I am asking this due to a weird problem related to > starting datanodes and tasktrackers.=20 > > In my case, the namenode box has 2 ethernet interfaces > combined as bond0 interface with IP address of IP_A and > there is an IP alias IP_B for local loopback interface as > lo:1. All slave boxes sit on the same network segment as > IP_B. > >=20 > > The network is configured such that no slave box can > reach namenode box at IP_A but namenode box can reach slave > boxes (clearly can only routed from bond0). So, slave boxes > always use "hdfs://IP_B:50001" as "fs.default.name" in > "core-site.xml" and use IP_B:50002" for job tracker in > mapred-site.xml to reach namenode box. > >=20 > > There are the following 2 cases how namenode (or > jobtracker) is configured on namenode box. > >=20 > > Case #1: If I set "fs.default.name" to > "hdfs://IP_B:50001", no slave boxes can join the cluster as > data nodes because the request to IP_B:50001 failed. "telnet > IP_B 50001" on slave boxes resulted in connection refused. > So, on namenode box, I fired "tcpdump -i bond0 tcp port > 50001" and then from a slave box did a "telnet IP_B 5001" > and watched for incoming and outgoing packets on namenode > box. > >=20 > > Case #2: If I set "fs.default.name" to > "hdfs://IP_A:50001", slave boxes can join the cluster as > data nodes. And I did the same thing to use tcpdump and > telnet to watch the traffic. I compared these two cases and > found some difference in the traffic. So, I want to know if > there is a hand-shaking stage for namenode and datanode to > establish a connection and what are the packets for this > purpose so that I can figure out if packets exchanged in > case #1 are correct or not, which may reveal why the > connection request from data node to name node fails. > >=20 > > Also in Case #2, although all slave boxes can join the > cluster as datanodes, no slave box can start as a > tasktracker because at the beginning of starting a > tasktracker, the tasktracker box uses IP_A:50001 to request > connection to namenode and as mentioned above (slaves are > not allowed to reach namenode at IP_A but reverse direction > is ok), this cannot be done. But my confusion here is that > on all slave boxes "fs.default.name" is set to use > IP_B:50001, how come it ended up with contacting the > namenode with IP_A:50001? > >=20 > > A bit complicated. But any thoughts? > >=20 >=20 > the NN listens on the card given by the IP address of its > hostname; it does not like people connecting to it using a > different hostname than the one it is on (irritating, > something to fix) > It sounds like you have DNS problems. you should have a > consistent mapping from hostname<-->IP Addr across the > entire cluster, but the issues you have indicate this may > not be the case. >=20 My case is more complicated. The network is configured such that slave boxe= s cannot reach master box via its "bond0" interface IP "A" (bond0 =3D eth0 = + eth1, the only physical network cards on master box). So, hostname has to= be mapped to its ALIAS IP address of its local loopback interface B, which= is in the same network segment as slave boxes. And because of this all sla= ves have to use B to talk to master box. Then if I run namenode/jobtracker on B, slaves cannot join the cluster as d= atanode because connection to namenode cannot be established and that's why= I want to know what information needs to be exchanged between namenode and= datanode to establish the connection. Steve, you mentioned that NN require= s IP of its hostname, what about DataNode, does DN also require a return pa= cket coming back from the NN IS from the IP of the "fs.default.name" specif= ied in its core-site.xml? If this is the case, it might explain why datanod= e cannot talk to namenode because returning packets use A, the IP of "bond0= " interface. If I run namenode/jobtracker on A, slaves are able to join the cluster as d= atanodes BUT somehow tasktrackers use address A to talk to namenode, which = simply fails because it is not allowed. So, I am confused why in this case = (both "fs.default.name" and "mapred.job.tracker" set to B on slaves) slaves= use A? Thanks, Michael =0A=0A=0A --0-125310353-1268159142=:85799--