Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 31222 invoked from network); 1 Jul 2008 20:37:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Jul 2008 20:37:23 -0000 Received: (qmail 18835 invoked by uid 500); 1 Jul 2008 20:37:21 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 18798 invoked by uid 500); 1 Jul 2008 20:37:21 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 18787 invoked by uid 99); 1 Jul 2008 20:37:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2008 13:37:21 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2008 20:36:27 +0000 Received: from [10.72.106.226] (heighthigh-lx.corp.yahoo.com [10.72.106.226]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id m61KZGQL001235; Tue, 1 Jul 2008 13:35:16 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=xckhP+TLMbmcPtLeDb9G5SSpXA++K7FD0XT6OL8f0L0SK9e0zV6yHMJuFcIFgSUm Message-ID: <486A9504.3050402@yahoo-inc.com> Date: Tue, 01 Jul 2008 13:35:16 -0700 From: Raghu Angadi User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: core-user@hadoop.apache.org CC: Colin Evans Subject: Re: DataXceiver: java.io.IOException: Connection reset by peer References: <6eb82e0806300430u44f01a6fl12795111013f42d4@mail.gmail.com> <3206A0DA-5A42-4B0E-B6ED-CC00F4E8102F@metaweb.com> In-Reply-To: <3206A0DA-5A42-4B0E-B6ED-CC00F4E8102F@metaweb.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org The difference is that same client behaviour results in different exception in 0.16 and 0.17 because of change to use NIO sockets. The current code ignores "SocketException". But with nio sockets, we just get an IOException. I will file a jira to avoid these error messages. For now these can be ignored. Raghu. Brian Karlak wrote: > >> 2008-06-30 19:27:45,760 ERROR org.apache.hadoop.dfs.DataNode: >> 192.168.23.1:500 >> 10:DataXceiver: java.io.IOException: Connection reset by peer > > > Hello All -- > > We also see this behavior. The Hadoop infrastructure appears to handle > these exceptions, in so much as the jobs still complete normally, but it > is disconcerting to see so many exceptions popping up in the logs. > > This behavior appears to have started as soon as we upgraded to 0.17.0. > It is still occurring in yesterday's 0.17.1 release. I have not been > able to reproduce it in the 0.16.4 or 0.16.3 releases. > > I'm a bit of a noob, but I wonder it it is possibly related to > HADOOP-2346, the introduction of timeouts on socket writes? Are there > any parameters to alter the timeout behavior? Or is the timeout hardcoded? > > We are also investigating HADOOP-3051 as a possible factor, considering > that the base exception is being raised in the sun.nio.ch package. > > This issue is consistent and reproducible in both of our clusters. it > appears to occur with high I/O load jobs. For instance, it occurs on > both our current production cluster as well as our the new 3-node > cluster whenever we run the "sort" test in the example jobs. It does > NOT occur when running the "pi" test. > > Any clues or leads would be most appreciated. > > Thanks, > Brian > > On Jun 30, 2008, at 4:30 AM, Rong-en Fan wrote: > >> Hi, >> >> I'm using Hadoop 0.17.1 with HBase trunk, and notice lots of exception >> in hadoop's log (it's a 3-node hdfs): >> >> 2008-06-30 19:27:45,760 ERROR org.apache.hadoop.dfs.DataNode: >> 192.168.23.1:500 >> 10:DataXceiver: java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcher.write0(Native Method) >> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) >> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) >> at sun.nio.ch.IOUtil.write(IOUtil.java:75) >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) >> at >> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:53) >> >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >> >> at >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144) >> >> at >> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105) >> >> at >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) >> at java.io.DataOutputStream.write(DataOutputStream.java:90) >> at >> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774) >> at >> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813) >> at >> org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) >> at >> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968) >> at java.lang.Thread.run(Thread.java:619) >> >> It seems to me that the datanode can not handle the incoming traffic. >> If so, what parameters in hadoop sire and/or in os (I'm using rhel 4) >> that >> I can play with? >> >> Thanks, >> Rong-En Fan > >