hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bluemetrix Development <bmdevelopm...@gmail.com>
Subject Re: hbase shell count crashes
Date Wed, 03 Mar 2010 20:41:33 GMT
For completeness sake, I'll update here.
The issue with shell counts and rowcounter crashing were fixed by upping
- open files to 32K (ulimit -n)
- dfs.datanode.max.xcievers to 2048
(I had overlooked this when moving to a larger cluster)

As for recovering from crashes, I haven't had much luck.
I'm only running a 3 server cluster so that may be an issue,
but when one server goes down, it doesn't seem to be too easy
to recover the Hbase table data after getting everything restarted again.
I've usually had to wipe hdfs and start from scratch.

On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
<bmdevelopment@gmail.com> wrote:
> Hi, Thanks for the suggestions. I'll make note of this.
> (I've decided to reinsert, as with time constraints it is probably
> quicker than trying to debug and recover.)
> So, I guess I am more concerned about trying to prevent this from
> happening again.
> Is it possible that a shell count caused enough load to crash hbase?
> Or that nodes becoming unavailable due to heavy network load could
> cause data corruption?
>
> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
> <michael_segel@hotmail.com> wrote:
>>
>> Try this...
>>
>> 1 run hadoop fsck /
>> 2 shut down hbase
>> 3 mv /hbase to /hbase.old
>> 4 restart /hbase (optional just for a sanity check)
>> 5 copy /hbase.old back to /hbase
>> 6 restart
>>
>> This may not help, but it can't hurt.
>> Depending on the size of your hbase database, it could take a while. On our sandbox,
we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't
ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy
cleaned up a database that wouldn't fully come up. May do the same for you.
>>
>> HTH
>>
>> -Mike
>>
>>
>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>> Subject: Re: hbase shell count crashes
>>> From: bmdevelopment@gmail.com
>>> To: hbase-user@hadoop.apache.org
>>>
>>> Hi,
>>> So after a few more attempts and crashes from trying the shell count,
>>> I ran the MR rowcounter and noticed that the number of rows were less
>>> than they should have been - even on smaller test tables.
>>> This led me to start looking through the logs and perform a few
>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>> are entirely missing - no longer show up under the shell list command.
>>>
>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>> lot of this in the master log.
>>>
>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>> info:regioninfo is empty for row:
>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>> info:serverstartcode
>>>
>>> Came across this in the regionserver log:
>>> 2010-02-16 23:58:33,851 WARN
>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>> because its empty. HBASE-646 DATA LOSS?
>>>
>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>> re-insert from scratch as this is still in testing phase,
>>> but would be curious to find out what has led to these issues in order
>>> to possibly fix or at least not repeat.
>>>
>>> Thanks
>>>
>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>> <bmdevelopment@gmail.com> wrote:
>>> > Hi, Thanks for the explanation.
>>> >
>>> > Yes, I was able to cat the file from all three of my region servers:
>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>> >
>>> > I have never came across this before, but this is the first time I've
>>> > had 7M rows in the db.
>>> > Is there anything going on that would bog down the network and cause
>>> > this file to be unreachable?
>>> >
>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>> >
>>> > Appreciate the help.
>>> >
>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>> >> This line
>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>> >> blk_-6288142015045035704_88516
>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>> >>
>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>> >> table (the table where all region addresses are). Are you able to open
that
>>> >> file using the bin/hadoop command line utility?
>>> >>
>>> >> J-D
>>> >>
>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>> >> bmdevelopment@gmail.com> wrote:
>>> >>
>>> >>> Hi,
>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>> >>> right towards the end.
>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>> >>> to become unavailable.
>>> >>>
>>> >>> Here's the tail end of the count output:
>>> >>> http://pastebin.com/m465346d0
>>> >>>
>>> >>> I'm on version 0.20.2 and running this command:
>>> >>> > count 'table', 1000000
>>> >>>
>>> >>> Anyone with similar issues or ideas on this?
>>> >>> Please let me know if you need further info.
>>> >>> Thanks
>>> >>>
>>> >>
>>> >
>>
>> _________________________________________________________________
>> Hotmail: Trusted email with powerful SPAM protection.
>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>

Mime
View raw message