Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53089D0FD for ; Thu, 20 Sep 2012 05:27:34 +0000 (UTC) Received: (qmail 12719 invoked by uid 500); 20 Sep 2012 05:27:31 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 12672 invoked by uid 500); 20 Sep 2012 05:27:31 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 12636 invoked by uid 99); 20 Sep 2012 05:27:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 05:27:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Sep 2012 05:27:25 +0000 Received: by oagj6 with SMTP id j6so2144706oag.14 for ; Wed, 19 Sep 2012 22:27:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=z+B3i3HXb970f/LzTQiyhLpSE5imQsz99y9H8/8UhyI=; b=nDxxoRcrRyRFn2zmgKhedyocxIFptC86IbsX5N/BjOm2z2Xm87Bw9IDdIWxmHr1uXo rOdmoMB0VWwSc5d+3s2z8/XUylP+Icy+RnCRJnZIpMGxTHxG1hWm6eieYMH123q1AdJU KowJyI9Hk4Wd174lPi5M7otFdsqE+GGiGyenytO2hBV+66fwwHzgp3XOCMmM46+eETk/ rSBM2/x+s/kR0skZYa2M3WGva5R8DCW2pXrm9YnedIu9CrWqzb+ezuvEmhWIOiwIis0W v/gtNVNVBNBNJ8dUzfn1dcsnmJq4TUv4t7rKaBeY/LoHGDWn+ef+8q7vVc4Ck1lUxTAv fecg== MIME-Version: 1.0 Received: by 10.60.7.230 with SMTP id m6mr473670oea.41.1348118824433; Wed, 19 Sep 2012 22:27:04 -0700 (PDT) Sender: saint.ack@gmail.com Received: by 10.76.25.201 with HTTP; Wed, 19 Sep 2012 22:27:04 -0700 (PDT) In-Reply-To: References: Date: Wed, 19 Sep 2012 22:27:04 -0700 X-Google-Sender-Auth: bkIOouqOdmVMGtT9gDntZ_MrPPc Message-ID: Subject: Re: IOException: Cannot append; log is closed -- data lost? From: Stack To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Sep 18, 2012 at 11:37 AM, Bryan Beaudreault wrote: > We are running cdh3u2 on a 150 node cluster, where 50 are HBase and 100 are > map reduce. The underlying hdfs spans all nodes. > This is a 0.90.4 HBase and then some Bryan? What was the issue serving data that you refer to? What did it look like? ... >> 12/09/18 12:34:00 INFO regionserver.HRegionServer: Waiting on 223 regions >> to close Looks like we are just hanging on the tail of the HRegionServer exit until all regions are closed. >> 12/09/18 12:34:01 ERROR regionserver.HRegionServer: >> java.io.IOException: Cannot append; log is closed >> at This is odd for sure; as though the WAL is closed but we are still trying to take on edits. > > We saw this for a bunch of regions. This may or may not be related (thats > what I'm trying to figure out), but now we are starting to see evidence > that some data may not have been persisted. The data that came in and got 'IOException: Cannot append; log is closed', this data is not persisted. > For instance we use the result > of an increment value as the id for another record, and we are starting to > see the increment return values that we have already used as ids. > You know what region has the increment? Can you trace where this region in the master log and see where it was deployed across the restart? Paste that regionservers log across the restart? > Does this exception mean that we lost data? Is there any way to recover it > if so? I don't see any hlogs in the .corrupt folder and not sure how to > proceed. > If you got an IOE putting an Increment because you could not write the WAL, that data didn't make it into HBase for sure. > One important thing to note is that we were decommissioning 50 of the 150 > datanodes at the time, using the dfs.exclude.hosts setting in the namenode. > I thought that was supposed to be a safe operation, so hopefully it didn't > cause this. > Should be fine if done incrementally w/ time in between for replication to catch up missing replicas. St.Ack