hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: hbase shell count crashes
Date Wed, 03 Mar 2010 22:58:50 GMT
Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244

As you can see we are working hard to stabilize HBase as much as possible ;)

J-D

On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development
<bmdevelopment@gmail.com> wrote:
> Yes, upgrading to 0.20.3 should be added to my list above. I have
> since done this.
> Thanks very much for that.
>
> On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> There were a lot of problems with Hadoop pre 0.20.2 for clusters
>> smaller than 10, especially 3 when having node failure. If you are
>> talking about just region servers, you are using 0.20.2 and 0.20.3 has
>> stability fixes.
>>
>> J-D
>>
>> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
>> <bmdevelopment@gmail.com> wrote:
>>> For completeness sake, I'll update here.
>>> The issue with shell counts and rowcounter crashing were fixed by upping
>>> - open files to 32K (ulimit -n)
>>> - dfs.datanode.max.xcievers to 2048
>>> (I had overlooked this when moving to a larger cluster)
>>>
>>> As for recovering from crashes, I haven't had much luck.
>>> I'm only running a 3 server cluster so that may be an issue,
>>> but when one server goes down, it doesn't seem to be too easy
>>> to recover the Hbase table data after getting everything restarted again.
>>> I've usually had to wipe hdfs and start from scratch.
>>>
>>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>>> <bmdevelopment@gmail.com> wrote:
>>>> Hi, Thanks for the suggestions. I'll make note of this.
>>>> (I've decided to reinsert, as with time constraints it is probably
>>>> quicker than trying to debug and recover.)
>>>> So, I guess I am more concerned about trying to prevent this from
>>>> happening again.
>>>> Is it possible that a shell count caused enough load to crash hbase?
>>>> Or that nodes becoming unavailable due to heavy network load could
>>>> cause data corruption?
>>>>
>>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>>> <michael_segel@hotmail.com> wrote:
>>>>>
>>>>> Try this...
>>>>>
>>>>> 1 run hadoop fsck /
>>>>> 2 shut down hbase
>>>>> 3 mv /hbase to /hbase.old
>>>>> 4 restart /hbase (optional just for a sanity check)
>>>>> 5 copy /hbase.old back to /hbase
>>>>> 6 restart
>>>>>
>>>>> This may not help, but it can't hurt.
>>>>> Depending on the size of your hbase database, it could take a while.
On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the
network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.)
Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>>> Subject: Re: hbase shell count crashes
>>>>>> From: bmdevelopment@gmail.com
>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>
>>>>>> Hi,
>>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>>> I ran the MR rowcounter and noticed that the number of rows were
less
>>>>>> than they should have been - even on smaller test tables.
>>>>>> This led me to start looking through the logs and perform a few
>>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>>
>>>>>> I'm not entirely sure what to look for in the logs, but I've noticed
a
>>>>>> lot of this in the master log.
>>>>>>
>>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>>> info:regioninfo is empty for row:
>>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>>> info:serverstartcode
>>>>>>
>>>>>> Came across this in the regionserver log:
>>>>>> 2010-02-16 23:58:33,851 WARN
>>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>>
>>>>>> Any ideas if the tables are recoverable? Its not a big deal for me
to
>>>>>> re-insert from scratch as this is still in testing phase,
>>>>>> but would be curious to find out what has led to these issues in
order
>>>>>> to possibly fix or at least not repeat.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>>> <bmdevelopment@gmail.com> wrote:
>>>>>> > Hi, Thanks for the explanation.
>>>>>> >
>>>>>> > Yes, I was able to cat the file from all three of my region
servers:
>>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698
> tmp.out
>>>>>> >
>>>>>> > I have never came across this before, but this is the first
time I've
>>>>>> > had 7M rows in the db.
>>>>>> > Is there anything going on that would bog down the network and
cause
>>>>>> > this file to be unreachable?
>>>>>> >
>>>>>> > I have 3 servers. The master is running the jobtracker, namenode
and hmaster.
>>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>>> >
>>>>>> > Appreciate the help.
>>>>>> >
>>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>> >> This line
>>>>>> >> java.io.IOException: java.io.IOException: Could not obtain
block:
>>>>>> >> blk_-6288142015045035704_88516
>>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>>> >>
>>>>>> >> Means that the region server wasn't able to fetch a block
for the .META.
>>>>>> >> table (the table where all region addresses are). Are you
able to open that
>>>>>> >> file using the bin/hadoop command line utility?
>>>>>> >>
>>>>>> >> J-D
>>>>>> >>
>>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development
<
>>>>>> >> bmdevelopment@gmail.com> wrote:
>>>>>> >>
>>>>>> >>> Hi,
>>>>>> >>> I'm currently trying to run a count in hbase shell and
it crashes
>>>>>> >>> right towards the end.
>>>>>> >>> This is turn seems to crash hbase or at least causes
the regionservers
>>>>>> >>> to become unavailable.
>>>>>> >>>
>>>>>> >>> Here's the tail end of the count output:
>>>>>> >>> http://pastebin.com/m465346d0
>>>>>> >>>
>>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>>> >>> > count 'table', 1000000
>>>>>> >>>
>>>>>> >>> Anyone with similar issues or ideas on this?
>>>>>> >>> Please let me know if you need further info.
>>>>>> >>> Thanks
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>
>>>>> _________________________________________________________________
>>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>>
>>>
>>
>

Mime
View raw message