Return-Path: Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: (qmail 46363 invoked from network); 5 Apr 2011 06:31:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Apr 2011 06:31:56 -0000 Received: (qmail 87518 invoked by uid 500); 5 Apr 2011 06:31:55 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 87496 invoked by uid 500); 5 Apr 2011 06:31:54 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 87479 invoked by uid 99); 5 Apr 2011 06:31:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Apr 2011 06:31:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.214.41] (HELO mail-bw0-f41.google.com) (209.85.214.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Apr 2011 06:31:47 +0000 Received: by bwz17 with SMTP id 17so113064bwz.14 for ; Mon, 04 Apr 2011 23:31:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.19.20 with SMTP id y20mr3960105bka.170.1301985085409; Mon, 04 Apr 2011 23:31:25 -0700 (PDT) Received: by 10.204.102.208 with HTTP; Mon, 4 Apr 2011 23:31:25 -0700 (PDT) In-Reply-To: References: Date: Tue, 5 Apr 2011 09:31:25 +0300 Message-ID: Subject: Re: zookeeper connection hangs during shutdown From: Bogdan Ghidireac To: dev@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Please see my answers inline ... On Mon, Apr 4, 2011 at 8:45 PM, Stack wrote: > On Mon, Apr 4, 2011 at 2:30 AM, Bogdan Ghidireac wro= te: >> Is is possible to add a timeout and then force a System.exit() ? >> > > Yes. Of course. =A0Sounds bad. =A0How you think this scenario came about? My M/R job reads from a table and creates a lot of data that is inserted into a second table. Because this new table is empty and I did not split the keys in advance, the region server where the first region was created is hit really hard (60-100K ops/sec). The OOM exception happens during this time, only for one or maybe two servers. The exception triggers a server shutdown... Once the initial region splits and the traffic is distributed, the problem does not happen any more. > Is the zk ensemble up and running still? The ZK ensemble is running fine. I have 3 zk servers running ZK 3.3.2. > Whats the last thing in this regionserver log? This is the RS log http://pastebin.com/Cvx8zS54 > Anything in the .out file? This is the System.out/err I http://pastebin.com/gNNVUzvZ > I've not seen this > before but, hey, the world is a wide and wonderful place. =A0We could > run the zk close inside a thread and interrupt if it goes on too long > (Let me ask the zk boys if they've seen this before too). > I am subscribed to ZK list too and I have seen you email. I am using ZK 3.3.2 ... > St.Ack > Thank you, Bogdan