Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1EB4E17DCC for ; Wed, 20 May 2015 15:13:04 +0000 (UTC) Received: (qmail 61789 invoked by uid 500); 20 May 2015 15:13:01 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 61725 invoked by uid 500); 20 May 2015 15:13:01 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 61633 invoked by uid 99); 20 May 2015 15:13:01 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2015 15:13:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BB66F1A37D0 for ; Wed, 20 May 2015 15:13:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.18 X-Spam-Level: **** X-Spam-Status: No, score=4.18 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, KAM_LINEPADDING=1.2, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id BApshkl3WP3m for ; Wed, 20 May 2015 15:12:59 +0000 (UTC) Received: from mail-qc0-f175.google.com (mail-qc0-f175.google.com [209.85.216.175]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 0F5C545468 for ; Wed, 20 May 2015 15:12:59 +0000 (UTC) Received: by qcir1 with SMTP id r1so24750461qci.3 for ; Wed, 20 May 2015 08:12:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=GtixkEd81TkjGsJpTR7+9x0xrM0Fw4y3wSB4S2gf28w=; b=fbL0rMjOBrXj4NC0vC69nYlGUy0lnYyz4ACFNcfm9tMwAYOOqX4CopmzjTCJooy00a MqK2YfnNCKi9OWEm/x4OGHdB+OC2GKJmhGBiz29mLjF2kzKJ4sKlWVDQT2XX1xWKfcmZ sW0mimz2EryptyQcMq1rCaWCeB9gMDA5EvqRZ+DXtbeAsL6arfgSDDy68VqbaUMT1s1C 5i3x1vOqLUBTA8L5KawnZqAWu7RPsMbkSz1St+SnnkIMg947/Sv447IgC47/T1ayRHUQ matstZRfs6MYcQcOpWHUMH+0ukWYjiuLPPVjctRBhP9U5D9Go1gBpuadQCE7YO4Ru9IG ZzrA== MIME-Version: 1.0 X-Received: by 10.140.84.104 with SMTP id k95mr7493540qgd.45.1432134778852; Wed, 20 May 2015 08:12:58 -0700 (PDT) Sender: saint.ack@gmail.com Received: by 10.140.143.209 with HTTP; Wed, 20 May 2015 08:12:58 -0700 (PDT) In-Reply-To: <28fc6650.1b6ce.14d708078d9.Coremail.c77_cn@163.com> References: <397b8c9c.62a5.14d4b2663e1.Coremail.c77_cn@163.com> <299d92ec.d243.14d64fe4ff2.Coremail.c77_cn@163.com> <3fe21a95.131e1.14d6ef32271.Coremail.c77_cn@163.com> <28fc6650.1b6ce.14d708078d9.Coremail.c77_cn@163.com> Date: Wed, 20 May 2015 08:12:58 -0700 X-Google-Sender-Auth: Y_4ICjP84TYFMxmTghD44Kwvh2I Message-ID: Subject: Re: How to know the root reason to cause RegionServer OOM? From: Stack To: Hbase-User Content-Type: multipart/alternative; boundary=001a11c12432c3c232051684dce1 --001a11c12432c3c232051684dce1 Content-Type: text/plain; charset=UTF-8 On Wed, May 20, 2015 at 1:46 AM, David chen wrote: > Thanks Ted, > For scenario #1, can not see any clues in regionserver log file that > denotes "kill -9" command was executed. Meanwhile, i think when JVM > inspects regionserver process OOME, it will create a new thread to execute > "kill -9 %p", the new thread should not write regionserver log, so the > fact, there is not any clues in regionserver log, is normal. Right? > For scenario #2, dmesg also did not provide any clues. But some clues were > seen in /var/log/messages: > ...... > May 14 12:00:38 localhost kernel: Out of memory: Kill process 22827 (java) > score 497 or sacrifice child > May 14 12:00:38 localhost kernel: Killed process 22827, UID 483, (java) > total-vm:17569220kB, anon-rss:16296276kB, file-rss:240kB > ...... > The 22827 above is regionserver PID. > It looks like regionserver itself OOM(total-vm:17569220kB, > anon-rss:16296276kB, the max-heap-size set is 15G), so was killed. Right? > Yes. > But hbase has no heavy load in the cluster, Doesn't matter. You allocated it a heap of 15G. The OS is looking for memory and is at a extreme (swapping totally disabled?) so it starts killing random processes. This is not an hbase issue. It is an oversubscription problem. Google how to address. > so i don't think it was killed because of itself OOME, instead i think > because of lack of memory for other applications, so OS kill regionserver > to run more applications. > I currently has no evidence to prove my idea, so hope more helps. Thanks. > You quote all necessary evidence above. St.Ack > > > > > > > > At 2015-05-20 10:04:19, "Ted Yu" wrote: > >For scenario #1, you would see in the regionserver.out file that "kill -9 > " > >command was applied due to OOME. > > > >For scenario #2, can you see if dmesg provides some clue ? > > > >Cheers > > > >On Tue, May 19, 2015 at 6:32 PM, David chen wrote: > > > >> Thanks for guys reply, its indeed helped me. > >> Another question, I think there are two possibilities to kill > RegionServer > >> process: > >> 1. When JVM inspects that the memory, RegionServer has occupied, exceed > >> the max-heap-size, then JVM calls positively the command configured by > >> option "-XX:OnOutOfMemoryError=kill -9 %p" to kill RegionServer > process. > >> 2. RegionServer process does not reach the max-heap-size, but new > >> application need to allocation memory, if lack of memory, OS will > choose > >> to kill some processes, RegionServer unfortunately becomes the first > >> choice, so it is killed by OS. > >> Is my understanding right? If so, how to know which possibility my scene > >> is? > >> Any ideas can be appreciated! > >> > --001a11c12432c3c232051684dce1--