hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: Region is not online: -ROOT-,,0
Date Wed, 26 Jan 2011 05:29:02 GMT
Thanks for the comments. Attached is the log file from the master
after the restart. The last error message was repeated every second.

See comments below.

On Tue, Jan 25, 2011 at 7:20 PM, Stack <stack@duboce.net> wrote:
> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <billgraham@gmail.com> wrote:
>> Hi,
>>
>> A developer on our team created a table today and something failed and
>> we fell back into the dire scenario we were in earlier this week. When
>> I got on the scene 2 of our 4 regions had crashed. When I brought them
>> back up, they wouldn't come online and the master was scrolling
>> messages like those in
>> https://issues.apache.org/jira/browse/HBASE-3406.
>>
>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>>
> Can you move to 0.90.0 release?

Will do. Was planning on doing this soon, but we'll prioritize this.

>
>
>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
>> getting two types of errors and the cluster won't come up:
>>
>> - On one of the regionservers:
>> 2011-01-25 15:12:00,287 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> NotServingRegionException; Region is not online: -ROOT-,,0
>>
>
> Can I see master log around startup please?

See attached.

>
>
>> - And on the master this scrolls every few seconds. the log file
>> referenced is empty in HDFS.
>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
>> Waited 275444ms for lease recovery on
>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>> failed to create file
>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
>> 10.14.98.90, because this file is already being created by NN_Recovery
>> on 10.10.220.15
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>>
>
>
> As Ryan says, this would seem to indicate the owning RegionServer is
> still up.  Is that the case?  Did the restart of the cluster for sure
> put down al RSs?

Yes, all RSs started up after the restart, just 2 wouldn't come online
and one of them was logging the errors about -ROOT-.

>
>
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>
>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>>
>
>
> Root will come back once master moves past log file splitting.

Yes, once I removed all logs from HDFS, the master came up and -ROOT-
was found. The splitting was hung on a file, hence the infinite loop
with AlreadyBeingCreatedExceptions.

>
> St.Ack
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message