From hdfs-issues-return-208033-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Jan 19 19:56:04 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id D7FAE180607 for ; Fri, 19 Jan 2018 19:56:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C7FC6160C36; Fri, 19 Jan 2018 18:56:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E8ED7160C1B for ; Fri, 19 Jan 2018 19:56:03 +0100 (CET) Received: (qmail 98312 invoked by uid 500); 19 Jan 2018 18:56:03 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 98301 invoked by uid 99); 19 Jan 2018 18:56:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jan 2018 18:56:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 857121A0895 for ; Fri, 19 Jan 2018 18:56:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.911 X-Spam-Level: X-Spam-Status: No, score=-99.911 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id wprSSioxsQaJ for ; Fri, 19 Jan 2018 18:56:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 05BDD5F4E7 for ; Fri, 19 Jan 2018 18:56:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8A05DE0E0B for ; Fri, 19 Jan 2018 18:56:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 452E721208 for ; Fri, 19 Jan 2018 18:56:00 +0000 (UTC) Date: Fri, 19 Jan 2018 18:56:00 +0000 (UTC) From: "Lei (Eddy) Xu (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-13039) StripedBlockReader#createBlockReader leaks socket on IOException MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-13039?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1633= 2751#comment-16332751 ]=20 Lei (Eddy) Xu commented on HDFS-13039: -------------------------------------- The cause can be found in the following log: {noformat} 2018-01-17 17:01:30,158 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e: Exception while creating remote block reader, datanode 10.17.XXX-XX-XXXX= 2 java.io.IOException: Got error, status=3DERROR, status message opReadBloc= k BP-437199909-10.17.206.21-1515442262037:blk_-9223372036852918159_256706 r= eceived exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundExc= eption: Replica not found for BP-437199909-10.17.206.21-1515442262037:blk_-= 9223372036852918159_256706, for OP_READ_BLOCK, self=3D/10.17.206.23:43208, = remote=3D/10.17.206.25:20002, for file dummy, for pool BP-437199909-10.17.2= 06.21-1515442262037 block -9223372036852918159_256706 at org.apache.hadoop.= hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTra= nsferProtoUtil.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.Da= taTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110) at o= rg.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReade= rRemote.java:447) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.n= ewBlockReader(BlockReaderRemote.java:415) at org.apache.hadoop.hdfs.server.= datanode.erasurecode.StripedBlockReader.createBlockReader(StripedBlockReade= r.java:127) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBl= ockReader.(StripedBlockReader.java:83) at org.apache.hadoop.hdfs.serv= er.datanode.erasurecode.StripedReader.createReader(StripedReader.java:169) = at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.initRea= ders(StripedReader.java:150) at org.apache.hadoop.hdfs.server.datanode.eras= urecode.StripedReader.init(StripedReader.java:133) at org.apache.hadoop.hdf= s.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockRec= onstructor.java:56) at java.util.concurrent.Executors$RunnableAdapter.call(= Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:= 266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecuto= r.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo= olExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} In {{StripedBlockReader#createBlockReader}}, the {{Peer}} object is not clo= sed on {{IOException}}. > StripedBlockReader#createBlockReader leaks socket on IOException > ---------------------------------------------------------------- > > Key: HDFS-13039 > URL: https://issues.apache.org/jira/browse/HDFS-13039 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding > Affects Versions: 3.0.0 > Reporter: Lei (Eddy) Xu > Assignee: Lei (Eddy) Xu > Priority: Critical > > When running EC on one cluster, DataNode has millions of {{CLOSE_WAIT}} c= onnections > {code:java} > $ grep CLOSE_WAIT lsof.out | wc -l > 10358700 > // All CLOSW_WAITs belong to the same DataNode process (pid=3D88527) > $ grep CLOSE_WAIT lsof.out | awk '{print $2}' | sort | uniq > 88527 > {code} > And DN can not open any file / socket, as shown in the log: > {noformat} > 2018-01-19 06:47:09,424 WARN io.netty.channel.DefaultChannelPipeline: An = exceptionCaught() event was fired, and it reached at the tail of the pipeli= ne. It usually means the last handler in the pipeline did not handle the ex= ception. > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelI= mpl.java:422) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelI= mpl.java:250) > at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessa= ges(NioServerSocketChannel.java:135) > at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsaf= e.read(AbstractNioMessageChannel.java:75) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventL= oop.java:563) > at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized= (NioEventLoop.java:504) > at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEvent= Loop.java:418) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390) > at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Singl= eThreadEventExecutor.java:742) > at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableD= ecorator.run(DefaultThreadFactory.java:145) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org