hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: HBase crash, need help getting back up
Date Thu, 09 Sep 2010 05:00:28 GMT
recovered.edits is the name of the file produced when wal logs are
split; one is made per region

Where you seeing that message?  Does it not have the full path the
recovered.edits file?

You are running w/ perms enabled on this cluster?

Why did the regionservers go down?

St.Ack

On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
> Well, it was short lived, it only stayed up for a couple hours, all region servers crashed
this time, not just one.
>
> Now, after restarting, I've got the master server complaining about not having executable
permissions on "recovered.edits".  Where is this file?
>
>  Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException:
Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>
> The message has repeated for a half hour, with this showing up in one region server:
>
> 2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
-ROOT-,,0
>
> I assume this will get better if I change permissions of some file... which one?
>
> -Matthew
>
>
> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>
>> I tried moving that file to tmp.  It appears as though the master is no longer stuck,
but clients are still not able to run queries.
>>
>> There aren't any messages passing by in the log files (just routine messages I see
when the server isn't doing anything), but attempts to run queries resulted in not server
region exceptions (i.e., count 'table').
>>
>> I tried enable 'table', and found that after this command there was a huge amount
of activity in the log files, and I was able to run queries again.
>>
>> There was no previous call to disable 'table', but for some reason HBase wasn't bringing
tables/regions online.
>>
>> I'm not sure what caused the problem or even if the actions I took will fix it again
in the future, but I am back up and running for now.
>>
>> FYI,
>>
>> -Matthew
>>
>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>
>>> My HBase cluster just crashed.   One of the Region servers stopped (do not yet
know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown
everything, and restart fresh.  I did so (including zookeeper and HDFS).
>>>
>>> Upon restart, I'm getting the following message in the Master's log file repeating
continuously with the number of ms waited counting up.
>>>
>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms
for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298
for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder
is trying to recreate file.
>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>
>>>
>>> The region servers are waiting with this being the final message in their log
file:
>>>
>>> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at 10.104.37.247:60000 that we are up
>>>
>>> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/
).
>>>
>>> The HDFS comes from CDH3.
>>>
>>> Does anybody have any ideas on what I can do to get back up and running?
>>>
>>> Thank you,
>>>
>>> Matthew
>>>
>>
>
>

Mime
View raw message