Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 7188 invoked from network); 3 Aug 2010 13:23:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Aug 2010 13:23:43 -0000 Received: (qmail 69670 invoked by uid 500); 3 Aug 2010 13:23:41 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 69544 invoked by uid 500); 3 Aug 2010 13:23:39 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 69536 invoked by uid 99); 3 Aug 2010 13:23:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Aug 2010 13:23:39 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jamie.cockrill@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-iw0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Aug 2010 13:23:32 +0000 Received: by iwn2 with SMTP id 2so5913193iwn.14 for ; Tue, 03 Aug 2010 06:23:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=3FPZwe3MgNmcvn8YieMO7+XzUy7IXuAHTs0t5uPtjyE=; b=vtbdf32sZoFyMXzDHxNUdrjx66pyhi8YOvgAvxoMiwVNCx2u2krDlf2ooYWaHA0WIb zTqXD/rFRADjQuO/hsrQvc+6H5dGOmOeaa2mDA9rjEGWiRlcU+MJxGXM0hq9ndFuYd2l yt9RBH+MWcMIRny0ZQ+qRME4WL9WOA/hpM3aI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=Hkn1s2klmPeZbTe41KUiqKVXME2iX38wHHGWNxmuANXQTJ5n0PetME7K4Yd4lwO1za czjlXDNdtbLPTIkidIZZ8AnJHp6/K1tn8VBnIFNiLgsmamLcBRvpJIEul4HCFwldKGwh Nq4qv4/HzoIQ0xuWtlr/v6CBAtITzdInkI5UQ= Received: by 10.231.12.136 with SMTP id x8mr8780086ibx.54.1280841791346; Tue, 03 Aug 2010 06:23:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.145.212 with HTTP; Tue, 3 Aug 2010 06:22:51 -0700 (PDT) In-Reply-To: References: From: Jamie Cockrill Date: Tue, 3 Aug 2010 14:22:51 +0100 Message-ID: Subject: Re: Regionserver tanked, can't seem to get master back up fully To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 PS, yes that was coming from master On 3 August 2010 14:22, Jamie Cockrill wrote: > Hi JD, > > The cluster is on a separated network, I'll see if any of the traces > remain. As for the ulimit and xceivers bit, those are setup correctly > as per the API doc you mention. > > Thanks > > Jamie > > On 2 August 2010 19:18, Jean-Daniel Cryans wrote: >> Is that coming from the master? If so, it means that it was trying to >> write recovered data from a failed region server and wasn't able to do >> so. It sounds bad. >> >> - Can we get full stack traces of that error? >> - Did you check the datanode logs for any exception? Very often >> (strong emphasis on "very"), it's an issue with either ulimit or >> xcievers. Is your cluster configured per the last bullet on that page? >> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements >> >> Thx >> >> J-D >> >> On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill wrote: >>> Hi All, >>> >>> I set off a long-running loading job over the weekend and it seems to >>> have rather destroyed my hbase cluster. Most of the nodes were down >>> this morning and upon restarting them, I'm now persistently getting >>> the following message every few ms in the master logs: >>> >>> DfsClient: Could not complete file >>> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename >>> >>> That file is a zero-byte file on the HDFS. The data-nodes all look >>> fine and don't seem to have had any trouble. I'm not especially fussed >>> about having to rebuild that table and reload it, but the trouble is >>> now that I can't start the cluster properly so I can drop the table. >>> >>> Does anyone know how I can remove the table/fix these errors manually. >>> As I said, I'm not fussed about data-loss. >>> >>> thanks >>> >>> Jamie >>> >> >