Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jamie.cockrill@gmail.com
 designates 209.85.214.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=Hkn1s2klmPeZbTe41KUiqKVXME2iX38wHHGWNxmuANXQTJ5n0PetME7K4Yd4lwO1za
         czjlXDNdtbLPTIkidIZZ8AnJHp6/K1tn8VBnIFNiLgsmamLcBRvpJIEul4HCFwldKGwh
         Nq4qv4/HzoIQ0xuWtlr/v6CBAtITzdInkI5UQ=
MIME-Version: 1.0
In-Reply-To: <AANLkTindm2gQ9Z9Mb7w2dv+RNCdkcZt-rPG6rKF2ZwDG@mail.gmail.com>
References: <AANLkTin4bUGYQrzm2Wkq6oGuAhg1CzsQkRoqPcB5FVRR@mail.gmail.com>
	<AANLkTintxDzSpy1ECXM3pO1hqH0qfdqhRRhUrX+tUKis@mail.gmail.com>
	<AANLkTindm2gQ9Z9Mb7w2dv+RNCdkcZt-rPG6rKF2ZwDG@mail.gmail.com>
From: Jamie Cockrill <jamie.cockrill@gmail.com>
Date: Tue, 3 Aug 2010 14:22:51 +0100
Message-ID: <AANLkTi==vOQCD0uFUERXTW8mhsoJ6M44_Oh=rDzwTcbF@mail.gmail.com>
Subject: Re: Regionserver tanked, can't seem to get master back up fully
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

PS, yes that was coming from master

On 3 August 2010 14:22, Jamie Cockrill <jamie.cockrill@gmail.com> wrote:
> Hi JD,
>
> The cluster is on a separated network, I'll see if any of the traces
> remain. As for the ulimit and xceivers bit, those are setup correctly
> as per the API doc you mention.
>
> Thanks
>
> Jamie
>
> On 2 August 2010 19:18, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> Is that coming from the master? If so, it means that it was trying to
>> write recovered data from a failed region server and wasn't able to do
>> so. It sounds bad.
>>
>> - Can we get full stack traces of that error?
>> - Did you check the datanode logs for any exception? Very often
>> (strong emphasis on "very"), it's an issue with either ulimit or
>> xcievers. Is your cluster configured per the last bullet on that page?
>> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
>>
>> Thx
>>
>> J-D
>>
>> On Mon, Aug 2, 2010 at 6:16 AM, Jamie Cockrill <jamie.cockrill@gmail.com> wrote:
>>> Hi All,
>>>
>>> I set off a long-running loading job over the weekend and it seems to
>>> have rather destroyed my hbase cluster. Most of the nodes were down
>>> this morning and upon restarting them, I'm now persistently getting
>>> the following message every few ms in the master logs:
>>>
>>> DfsClient: Could not complete file
>>> /hbase/.logs/compute17.cluster1.lan,60020,1280518716613/a filename
>>>
>>> That file is a zero-byte file on the HDFS. The data-nodes all look
>>> fine and don't seem to have had any trouble. I'm not especially fussed
>>> about having to rebuild that table and reload it, but the trouble is
>>> now that I can't start the cluster properly so I can drop the table.
>>>
>>> Does anyone know how I can remove the table/fix these errors manually.
>>> As I said, I'm not fussed about data-loss.
>>>
>>> thanks
>>>
>>> Jamie
>>>
>>
>