hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew LeMieux <...@mlogiciels.com>
Subject Re: HBase crash, need help getting back up
Date Thu, 09 Sep 2010 17:24:53 GMT
Replies below

On Sep 8, 2010, at 10:00 PM, Stack wrote:

> recovered.edits is the name of the file produced when wal logs are
> split; one is made per region
> Where you seeing that message?  Does it not have the full path the
> recovered.edits file?

In the master log file.  Full path was not there. 

> You are running w/ perms enabled on this cluster?

It was enabled and it has now been turned off.  Will that fix the problem of a file not being
executable?  In any case that problem is intermittent.  It usually shows up only after a partial
restart (i.e. a Region server goes down and I restart it), but does not show up after a complete
restart of the whole cluster. 

> Why did the regionservers go down?

I tracked the reason for the most recent "crash" down to "too many open files" for the user
that runs hadoop.  Very odd situation, both the user running hbase and hadoop were in the
/etc/security/limits.conf file with a limit of 50000, but the change only worked for one user.
  hadoop's account reported 1024, and the hbase user's account reported 50000 to 'ulimit -n'.
  I did three things before rebooting the machine, not sure which were needed to fix it: 
    *  I added "session required        pam_limits.so" to /etc/pam.d/common-session (pam_limits.so
was already being referenced in several other files in /etc/pam.d, but was missing from this
    *  gave hadoop a home directory that exists (by editing the /etc/passwd file)
    *  I added "*                hard    nofile          50000" to the /etc/security/limits.conf
file (in addition to the two lines for each user that were already there)

(on Ubuntu Karmic, running CDH version: 0.20.2+320-1~karmic-cdh3b2)

The CDH distribution doesn't appear to have the hadoop home directory situation figured out
(they put it in a directory that gets deleted on reboots).  I change it routinely, but apparently
missed this machine.  

This is likely to fix quite a few problems, but I think there is still a mystery to be solved.
 I'll have to wait until it happens again to get a clean log of the event. 



> On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>> Well, it was short lived, it only stayed up for a couple hours, all region servers
crashed this time, not just one.
>> Now, after restarting, I've got the master server complaining about not having executable
permissions on "recovered.edits".  Where is this file?
>>  Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException:
Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>> The message has repeated for a half hour, with this showing up in one region server:
>> 2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer:
NotServingRegionException; -ROOT-,,0
>> I assume this will get better if I change permissions of some file... which one?
>> -Matthew
>> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>>> I tried moving that file to tmp.  It appears as though the master is no longer
stuck, but clients are still not able to run queries.
>>> There aren't any messages passing by in the log files (just routine messages
I see when the server isn't doing anything), but attempts to run queries resulted in not server
region exceptions (i.e., count 'table').
>>> I tried enable 'table', and found that after this command there was a huge amount
of activity in the log files, and I was able to run queries again.
>>> There was no previous call to disable 'table', but for some reason HBase wasn't
bringing tables/regions online.
>>> I'm not sure what caused the problem or even if the actions I took will fix it
again in the future, but I am back up and running for now.
>>> FYI,
>>> -Matthew
>>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>>> My HBase cluster just crashed.   One of the Region servers stopped (do not
yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown
everything, and restart fresh.  I did so (including zookeeper and HDFS).
>>>> Upon restart, I'm getting the following message in the Master's log file
repeating continuously with the number of ms waited counting up.
>>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited
69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
for DFSClient_hb_m_10.104.37.247:60000 on client because current leaseholder
is trying to recreate file.
>>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>>       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>> The region servers are waiting with this being the final message in their
log file:
>>>> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at that we are up
>>>> I've  been using this version for a little under a week without incident
(http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>>>> The HDFS comes from CDH3.
>>>> Does anybody have any ideas on what I can do to get back up and running?
>>>> Thank you,
>>>> Matthew

View raw message