From issues-return-341877-archive-asf-public=cust-asf.ponee.io@hbase.apache.org Wed Apr 4 05:28:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E368F180675 for ; Wed, 4 Apr 2018 05:28:04 +0200 (CEST) Received: (qmail 4663 invoked by uid 500); 4 Apr 2018 03:28:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 4652 invoked by uid 99); 4 Apr 2018 03:28:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Apr 2018 03:28:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 13B67C00E0 for ; Wed, 4 Apr 2018 03:28:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.51 X-Spam-Level: X-Spam-Status: No, score=-109.51 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id JsGS8ObfLJ0Y for ; Wed, 4 Apr 2018 03:28:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 05A455F1B4 for ; Wed, 4 Apr 2018 03:28:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7B6B6E00D6 for ; Wed, 4 Apr 2018 03:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3B67525611 for ; Wed, 4 Apr 2018 03:28:00 +0000 (UTC) Date: Wed, 4 Apr 2018 03:28:00 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-20338) WALProcedureStore#recoverLease() should have fixed sleeps and/ or exponential backoff for retrying rollWriter() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424937#comment-16424937 ] Wei-Chiu Chuang commented on HBASE-20338: ----------------------------------------- Hi Umesh, I'd like to work on this improvement. Thanks for filing the jira. > WALProcedureStore#recoverLease() should have fixed sleeps and/ or exponential backoff for retrying rollWriter() > --------------------------------------------------------------------------------------------------------------- > > Key: HBASE-20338 > URL: https://issues.apache.org/jira/browse/HBASE-20338 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0-beta-2 > Reporter: Umesh Agashe > Assignee: Wei-Chiu Chuang > Priority: Major > > In our internal testing we observed that logs are getting flooded due to continuous loop in WALProcedureStore#recoverLease(): > {code} > while (isRunning()) { > // Get Log-MaxID and recover lease on old logs > try { > flushLogId = initOldLogs(oldLogs); > } catch (FileNotFoundException e) { > LOG.warn("Someone else is active and deleted logs. retrying.", e); > oldLogs = getLogFiles(); > continue; > } > // Create new state-log > if (!rollWriter(flushLogId + 1)) { > // someone else has already created this log > LOG.debug("Someone else has already created log " + flushLogId); > continue; > } > {code} > rollWriter() fails to create a new file. Error messages in HDFS namenode logs around same time: > {code} > INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 172.31.121.196:38508 Call#3141 Retry#0 > java.io.IOException: Exeption while contacting value generator > at org.apache.hadoop.crypto.key.kms.ValueQueue.getAtMost(ValueQueue.java:389) > at org.apache.hadoop.crypto.key.kms.ValueQueue.getNext(ValueQueue.java:291) > at org.apache.hadoop.crypto.key.kms.KMSClientProvider.generateEncryptedKey(KMSClientProvider.java:724) > at org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.generateEncryptedKey(KeyProviderCryptoExtension.java:511) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$2.run(FSNamesystem.java:2680) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$2.run(FSNamesystem.java:2676) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477) > at org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:458) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.generateEncryptedDataEncryptionKey(FSNamesystem.java:2675) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2815) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2712) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604) > at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:115) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:412) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220) > Caused by: java.net.ConnectException: Connection refused (Connection refused) > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) > at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:264) > at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367) > at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) > at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138) > at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032) > at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) > at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1546) > at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) > at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) > at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338) > at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:503) > at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:488) > at org.apache.hadoop.crypto.key.kms.KMSClientProvider.access$200(KMSClientProvider.java:94) > at org.apache.hadoop.crypto.key.kms.KMSClientProvider$EncryptedQueueRefiller.fillQueueForKey(KMSClientProvider.java:149) > at org.apache.hadoop.crypto.key.kms.ValueQueue.getAtMost(ValueQueue.java:378) > ... 25 more > {code} > Both HDFS NameNode and HBase Master logs are filling up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)