Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B9994200C3A for ; Fri, 3 Mar 2017 00:59:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B845B160B7A; Thu, 2 Mar 2017 23:59:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0EA1F160B6F for ; Fri, 3 Mar 2017 00:59:32 +0100 (CET) Received: (qmail 13684 invoked by uid 500); 2 Mar 2017 23:59:32 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 13671 invoked by uid 99); 2 Mar 2017 23:59:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Mar 2017 23:59:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 69EFF18D4A3 for ; Thu, 2 Mar 2017 23:59:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.679 X-Spam-Level: * X-Spam-Status: No, score=1.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id el1L5W5-ruE2 for ; Thu, 2 Mar 2017 23:59:29 +0000 (UTC) Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 352545F257 for ; Thu, 2 Mar 2017 23:59:28 +0000 (UTC) Received: by mail-qk0-f170.google.com with SMTP id m67so36303571qkf.2 for ; Thu, 02 Mar 2017 15:59:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=IY9qkTe32K5/keWVJWB5B26prGfb+bOHRmwy6tatonQ=; b=qNJZbzRrFOWp6hyh9DToJdKBQXId8aTJTUKNbnHFs0qHSlwS6/cQFbLnY73RVAzyZA Twl49lDUJgkhrjiwJ3mrtyB2/a8WXtQVvSmUo6PBsjba+ZCx94vvNKXV6OuWWslvuXPS yKDvUJREQPCm2pOk5jnk1Iw/L/Yf1T95woUNy3Z1d3f0KHS8weGzu/g2tP/BpvgbGynr TygwiRJHUhxpm9VXo3WWpyRIku9Mzj8filNEVsWNiUYA9HfxKbtpo2WZ1GUU0XZaPwSs F4Vl2ehORRo5S5hLu+VaZS08Cf6O1MDACC/URflj8r96UZdjcQGwGBRsxZGHN0ei0Brl u8fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=IY9qkTe32K5/keWVJWB5B26prGfb+bOHRmwy6tatonQ=; b=goUQcHCDuGmhjmnStTqiuqSOU1jB8a7uUFgnA/ODBTDTBXQN2yffRqMgP2t305t/hg nrsMhmQaLNtfcsWzpPhZ5nyr8Z978AjTPRYeqPr0n5RQDsplVBigPy8RB0cD2kRXII42 GHLruRvtLAKabzZW8POHKq7YjOIxpX6Pnw8cwUDDknO8T2xglIZFK5eVx/uMaV/R6UTy HE4uhNznmTtUpWUrFweesGj50mdFZkuKuGxO9MQ+zCYflaYHUMiS/uFavUiW12H0mm/S u0BA0zNSTidv39H+4IuPwcKNzcLhwVSN2pTR9AlWAOyLik7kCmNxmqXk/3DubnUBxZhk spNg== X-Gm-Message-State: AMke39mPRhGtrtXEjY4FI9wYz24ONIadJIqmFZoQu/wTX8BlSS2O23FF5blw/62MgjrYWAlAzahNN+oTLWfjTQ== X-Received: by 10.200.46.123 with SMTP id s56mr21886977qta.8.1488499164495; Thu, 02 Mar 2017 15:59:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.200.58.68 with HTTP; Thu, 2 Mar 2017 15:59:04 -0800 (PST) From: Abhishek Das Date: Thu, 2 Mar 2017 15:59:04 -0800 Message-ID: Subject: LevelDB corruption in YARN Application TimelineServer To: yarn-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113d4ed2f2fa540549c83704 archived-at: Thu, 02 Mar 2017 23:59:33 -0000 --001a113d4ed2f2fa540549c83704 Content-Type: text/plain; charset=UTF-8 Hi, I am running a hadoop 2.6.0 cluster in ec2 instances with r3.2xlarge as instance of the master node. YARN Application TimelineServer running in the master node is throwing an exception because of leveldb corruption. This issue seems to be happening when the cluster has been up for a long time (more than 7 days). The stack trace is given below. ERROR org.apache.hadoop.yarn.server.timeline.TimelineDataManager: Skip the timeline entity: { id: , type: TEZ_TASK_ID } java.lang.RuntimeException: org.fusesource.leveldbjni.internal.NativeDB$DBException: *IO error: /media/ephemeral0/hadoop-root/yarn/timeline/leveldb-timeline-store.ldb/330951.sst: No such file or directory* at org.fusesource.leveldbjni.internal.JniDBIterator.seek(JniDBIterator.java:68) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntity(LeveldbTimelineStore.java:444) at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:257) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:259) at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.yarn.server.timeline.webapp.CrossOriginFilter.doFilter(CrossOriginFilter.java:95) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:269) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:542) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1242) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) There are lot of .sst files in the level db directory. *sudo ls -lrt /media/ephemeral0/hadoop-root/yarn/timeline/leveldb-timeline-store.ldb/ | wc -l* *3848* After this error the ResourceManager and Tez ApplicationMaster are not able to post entities in the YARN ATS. So not able to see the history of the running jobs. Does anyone have any idea what is the root cause of this leveldb corruption and how to get rid off this issue. Thanks in advance. Regards, Abhishek --001a113d4ed2f2fa540549c83704--