Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4B08CD26 for ; Fri, 16 Jan 2015 08:11:00 +0000 (UTC) Received: (qmail 86785 invoked by uid 500); 16 Jan 2015 08:10:51 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 86680 invoked by uid 500); 16 Jan 2015 08:10:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 86670 invoked by uid 99); 16 Jan 2015 08:10:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Jan 2015 08:10:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adrien.mogenet@content-square.fr designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qa0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Jan 2015 08:10:25 +0000 Received: by mail-qa0-f42.google.com with SMTP id dc16so14479151qab.1 for ; Fri, 16 Jan 2015 00:10:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=ybANmT5aLhKuADfZzjnFU72RAFUKns5Fy3dSOS5PVUk=; b=WVInwqh06CvdbBKXKNJyQb56jDDNNd79FsF4m7Hg6Y6SmNyEeUkGCh49U4ThxfD48J wXfUhjJv1iOhWm41Pcyy+ezhx15jvn4uRVEHaZGpbKBRCnRz9XFCodyih3rtJwcf8fR1 Ie7Qjdz/isI4sKB/4mSIOg2RWI81W78qDTZhb3PDTVq7seFac9dNAR2TxqTM1n26me+r 84+J607VIVZ6Y7BaCIlh9ON4sqMYTJyRMy43vlr2mNy1ywz3rleElZnSbA8ctU4mcSbi EvRft38rXj2fAlTNDwmyizyaxoV1sCn7Dtwsdluq5vI0NwOsGKPw1tgMoZDQ3RgvH6nc V/ug== X-Gm-Message-State: ALoCoQkwF9QMfGmHhZTC1YGg/hr+bysvykBU6+BrdL3K+FligTamtPkG0WXsHSrC0bxhuOZ5Eh8x X-Received: by 10.229.219.74 with SMTP id ht10mr17718828qcb.25.1421395822913; Fri, 16 Jan 2015 00:10:22 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.83.75 with HTTP; Fri, 16 Jan 2015 00:10:01 -0800 (PST) From: Adrien Mogenet Date: Fri, 16 Jan 2015 09:10:01 +0100 Message-ID: Subject: Poor HDFS performances: "Slow BlockReceiver write packet to mirror" To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11348fcc1c45f4050cc0816a X-Virus-Checked: Checked by ClamAV on apache.org --001a11348fcc1c45f4050cc0816a Content-Type: text/plain; charset=UTF-8 Hi there, I'd like to submit a strange behavior while instanciating another new Hadoop cluster, on a new hardware stack. Once everything got installed, as soon as we try to perform any I/O operation on HDFS, we can see many of these messages within the datanode logs: 15/01/14 22:13:07 WARN datanode.DataNode: Slow BlockReceiver write packet to mirror took 6339ms (threshold=300ms) 15/01/14 22:13:26 INFO DataNode.clienttrace: src: /10.10.5.7:17276, dest: / 10.10.5.4:50010, bytes: 176285, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-832581408_1, offset: 0, srvID: af886556-96db-4b03-9b5b-cd20c3d66f5a, blockid: BP-784291941-127.0.1.1-1420922413498:blk_1073742333_1531, duration: 19383299287 Followed by the famous one: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read (...) We've suspected dual-VLAN + bonded network interfaces (2x10 GBps) ot be part of this, but, of course, we double-checked lots of these points and found nothing: iperf, dd/hdparm, increasing Xmx (8 GB), sysbench... We only found that the cluster had a pretty big `await` time on its disk when running HDFS (>500ms, correlated to our log messages), but we can't explain clearly what happened. Even if you will all suspect HDDs to be the cause of our troubles, can someone explain these log messages? We can't find anything interesting in source code, except it occurs while doing `flush or sync` (makes sense...). Setup: Hadoop 2.6.0 9 Datanodes Debian 3.2.63-2+deb7u2 x86_64 10x 1TB SAS drives OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) Best, -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.mogenet@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 4, avenue Franklin D. Roosevelt - 75008 Paris --001a11348fcc1c45f4050cc0816a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi there,

I'd like to submit a strange behavio= r while instanciating another new Hadoop cluster, on a new hardware stack.= =C2=A0

Once everything got installed, as soon as w= e try to perform any I/O operation on HDFS, we can see many of these messag= es within the datanode logs:

15/01/14 22:13:07 WAR= N datanode.DataNode: Slow BlockReceiver write packet to mirror took 6339ms = (threshold=3D300ms)
15/01/14 22:13:26 INFO DataNode.clienttr= ace: src: /10.10.5.7:17276, dest: /<= a href=3D"http://10.10.5.4:50010">10.10.5.4:50010, bytes: 176285, op: H= DFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-832581408_1, offset: 0, srvID: af= 886556-96db-4b03-9b5b-cd20c3d66f5a, blockid: BP-784291941-127.0.1.1-1420922= 413498:blk_1073742333_1531, duration: 19383299287

Followed by the famous one:

java.net.SocketT= imeoutException: 60000 millis timeout while waiting for channel to be ready= for read (...)

We've suspected dual-VLAN = + bonded network interfaces (2x10 GBps) ot be part of this, but, of course,= we double-checked lots of these points and found nothing: iperf, dd/hdparm= , increasing Xmx (8 GB), sysbench...
We only found that the clust= er had a pretty big `await` time on its disk when running HDFS (>500ms, = correlated to our log messages), but we can't explain clearly what happ= ened.

Even if you will all suspect HDDs to be the = cause of our troubles, can someone explain these log messages? We can't= find anything interesting in source code, =C2=A0except it occurs while doi= ng `flush or sync` (makes sense...).

Setup:
<= div>Hadoop 2.6.0
9 Datanodes
Debian 3.2.63-2+deb7u2 x86= _64
10x 1TB SAS drives
OpenJDK Runtime Enviro= nment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1)
OpenJDK 64-Bit Server= VM (build 24.65-b04, mixed mode)

Best,
<= /div>

--
=

Adrien= Mogenet
Head of Backend/Infrastructu= re
(+33)= 6.59.16.64.22
4, avenue Franklin D. Roosevelt - 75008 Paris
--001a11348fcc1c45f4050cc0816a--