Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2106C200BE7 for ; Tue, 20 Dec 2016 17:12:14 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1FBBF160B29; Tue, 20 Dec 2016 16:12:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6BE01160B12 for ; Tue, 20 Dec 2016 17:12:13 +0100 (CET) Received: (qmail 83695 invoked by uid 500); 20 Dec 2016 16:12:11 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83685 invoked by uid 99); 20 Dec 2016 16:12:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Dec 2016 16:12:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A587C18C4AA for ; Tue, 20 Dec 2016 16:12:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.3 X-Spam-Level: X-Spam-Status: No, score=0.3 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id s7er2QGVYB20 for ; Tue, 20 Dec 2016 16:12:09 +0000 (UTC) Received: from winters.swishmail.com (winters.swishmail.com [208.72.56.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B2F645FB5D for ; Tue, 20 Dec 2016 16:12:08 +0000 (UTC) Received: (qmail 8513 invoked by uid 89); 20 Dec 2016 16:11:56 -0000 Received: from unknown (HELO Tower) (jnaegele@grierforensics.com@69.140.59.45) by winters.swishmail.com with ESMTPSA (ECDHE-RSA-AES256-GCM-SHA384 encrypted, authenticated); 20 Dec 2016 16:11:56 -0000 From: "Joseph Naegele" To: Subject: SocketTimeoutException in DataXceiver Date: Tue, 20 Dec 2016 11:11:55 -0500 Message-ID: <044301d25adb$c8a5ea60$59f1bf20$@grierforensics.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdJa2MVYxivUdp49Q92KC+o5jGihBA== Content-Language: en-us archived-at: Tue, 20 Dec 2016 16:12:14 -0000 Hi folks, I'm experiencing the exact symptoms of HDFS-770 = (https://issues.apache.org/jira/browse/HDFS-770) using Spark and a basic = HDFS deployment. Everything is running locally on a single machine. I'm = using Hadoop 2.7.3. My HDFS deployment consists of a single 8 TB disk = with replication disabled, otherwise everything is vanilla Hadoop 2.7.3. = My Spark job uses a Hive ORC writer to write a dataset to disk. The = dataset itself is < 100 GB uncompressed, ~17 GB compressed. It does not appear to be a Spark issue. The datanode's logs show it = receives the first ~500 packets for a block, then nothing for a minute, = then the default channel read timeout of 60000 ms causes the exception: 2016-12-19 18:36:50,632 INFO = org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock = BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632 received = exception java.net.SocketTimeoutException: 60000 millis timeout while = waiting for channel to be ready for read. ch : = java.nio.channels.SocketChannel[connected local=3D/127.0.0.1:50010 = remote=3D/127.0.0.1:55866] 2016-12-19 18:36:50,632 ERROR = org.apache.hadoop.hdfs.server.datanode.DataNode: = lamport.grierforensics.com:50010:DataXceiver error processing = WRITE_BLOCK operation src: /127.0.0.1:55866 dst: /127.0.0.1:50010 java.net.SocketTimeoutException: 60000 millis timeout while waiting for = channel to be ready for read. ch : = java.nio.channels.SocketChannel[connected local=3D/127.0.0.1:50010 = remote=3D/127.0.0.1:55866] at = org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:1= 64) at = org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) ... On the Spark side, all is well until the datanode's socket exception = results in Spark experiencing a DFSOutputStream ResponseProcessor = exception, followed by Spark aborting due to all datanodes being bad: 2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream = ResponseProcessor exception for block = BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632 java.io.EOFException: Premature EOF: no length prefix available at = org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:220= 3) at = org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(Pipel= ineAck.java:176) at = org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run= (DFSOutputStream.java:867) ... Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. = Aborting... at = org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppen= dOrRecovery(DFSOutputStream.java:1206) at = org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(= DFSOutputStream.java:1004) at = org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.j= ava:548) I haven't tried adjusting the timeout yet for the same reason specified = by the reporter of HDFS-770: I'm running everything locally, with no = other tasks running on the system so why would I need a socket read = timeout greater than 60 seconds? I haven't observed any CPU, memory or = disk bottlenecks. Lowering the number of cores used by Spark does help alleviate the = problem, but doesn't eliminate it, which led me to believe the issue may = be disk contention (i.e. too many client writers?), but again, I haven't = observed any disk IO bottlenecks at all. Does anyone else still experience HDFS-770 = (https://issues.apache.org/jira/browse/HDFS-770) and is there a general = approach/solution? Thanks --- Joe Naegele Grier Forensics --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional commands, e-mail: user-help@hadoop.apache.org