hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Never ending distributed log split
Date Sun, 02 Jun 2013 17:05:34 GMT
I'm using 0.94.7 since I did not get the chance to deploye the last RC...

I will wait for some more feedback regarding the option (delete or
rename) and most probably will open a JIRA.

Regardeing recovered.editsI don't have this file anymore, but I just
found another one which is blocking some other splits:


hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/
Found 5 items
drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/.oldlogs
-rw-r--r--   3 hbase supergroup        855 2013-06-01 16:56
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/.regioninfo
drwxr-xr-x   - hbase supergroup          0 2013-06-01 17:48
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/@
drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/a
-rw-r--r--   3 hbase supergroup       5375 2013-06-01 15:43
/hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/recovered.edits

And the date/time seems to match when I faced the 2 power outages yesterday...

JM

2013/6/2 Ted Yu <yuzhihong@gmail.com>:
> Can you search for 1d44b0630ed7785106a87a2bd4993551/recovered.edits to see
> when it was created ?
> Namenode log would be a good place to start with.
>
> bq. we can also rename it so if really required we can replay it later?
>
> The above is a better way of handling the situation.
>
> What version of HBase are you using ?
>
> Cheers
>
> On Sun, Jun 2, 2013 at 8:09 AM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
>> wrote:
>
>> My HBase was in a bad state recently. HBCK did a slow but good job and
>> everything is now almost stable. However, I still have one log split
>> which is not working. Every minute, the SplitLogManager try to split
>> the log, fails, and retry. It's always the same file. It's assigned to
>> different nodes, but all failed, and it's starting again and again.
>>
>>
>> 2013-06-02 10:44:20,298 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> logs to split
>> 2013-06-02 10:44:20,298 INFO
>> org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs
>> in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
>> 2013-06-02 10:44:20,298 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: wait for status of
>> task
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> to change to DELETED
>> 2013-06-02 10:44:20,315 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback:
>> deleted
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> 2013-06-02 10:44:20,329 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
>> at znode
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> 2013-06-02 10:44:20,341 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
>> at znode
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
>> 2013-06-02 10:44:20,344 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> ver = 0
>> 2013-06-02 10:44:20,346 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
>> ver = 0
>> 2013-06-02 10:44:20,384 INFO
>> org.apache.hadoop.hbase.master.SplitLogManager: task
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> acquired by node1,60020,1370136472290
>> 2013-06-02 10:44:20,410 INFO
>> org.apache.hadoop.hbase.master.SplitLogManager: task
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
>> acquired by node4,60020,1370136467255
>> 2013-06-02 10:44:20,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184260384 last_version = 1 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 113 while the timeout is 300000
>> 2013-06-02 10:44:20,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184260410 last_version = 1 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 87 while the timeout is 300000
>> 2013-06-02 10:44:20,497 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:21,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 120 while the timeout is 300000
>> 2013-06-02 10:44:21,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 87 while the timeout is 300000
>> 2013-06-02 10:44:21,497 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:21,708 DEBUG
>> org.apache.hadoop.hbase.master.ServerManager: REPORT: Server
>> node7,60020,1370136467731 came back up, removed it from the dead
>> servers list
>> 2013-06-02 10:44:22,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 1120 while the timeout is 300000
>> 2013-06-02 10:44:22,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 1087 while the timeout is 300000
>> 2013-06-02 10:44:22,497 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:23,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 2120 while the timeout is 300000
>> 2013-06-02 10:44:23,497 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 2087 while the timeout is 300000
>> 2013-06-02 10:44:23,497 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:24,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 3121 while the timeout is 300000
>> 2013-06-02 10:44:24,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 3088 while the timeout is 300000
>> 2013-06-02 10:44:24,498 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:25,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 4121 while the timeout is 300000
>> 2013-06-02 10:44:25,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 4088 while the timeout is 300000
>> 2013-06-02 10:44:25,498 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:44:26,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261377 last_version = 2 cur_worker_name =
>> node1,60020,1370136472290 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node1,60020,1370136472290 is not marked as dead, we waited for
>> 5121 while the timeout is 300000
>> 2013-06-02 10:44:26,498 TRACE
>> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
>> of last_update = 1370184261410 last_version = 2 cur_worker_name =
>> node4,60020,1370136467255 status = in_progress incarnation = 0
>> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
>> server node4,60020,1370136467255 is not marked as dead, we waited for
>> 5088 while the timeout is 300000
>> 2013-06-02 10:44:26,498 DEBUG
>> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
>> unassigned = 0
>> 2013-06-02 10:45:20,551 INFO
>> org.apache.hadoop.hbase.master.SplitLogManager: task
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
>> entered state err node4,60020,1370136467255
>> 2013-06-02 10:45:20,551 WARN
>> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
>> 2013-06-02 10:45:20,990 INFO
>> org.apache.hadoop.hbase.master.SplitLogManager: task
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> entered state err node1,60020,1370136472290
>> 2013-06-02 10:45:20,990 WARN
>> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting
>>
>> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
>> 2013-06-02 10:45:20,990 WARN
>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
>> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
>> installed = 2 but only 0 done
>> 2013-06-02 10:45:20,990 ERROR
>> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
>> processing event M_META_SERVER_SHUTDOWN
>> java.io.IOException: failed log splitting for
>> node7,60020,1370118961527, will retry
>>         at
>> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:69)
>>         at
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>         at java.lang.Thread.run(Thread.java:722)
>> Caused by: java.io.IOException: error or interrupted while splitting
>> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
>> Task = installed = 2 done = 0 error = 2
>>         at
>> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:299)
>>         at
>> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371)
>>         at
>> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:341)
>>         at
>> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:288)
>>         at
>> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:63)
>>         ... 4 more
>> 2013-06-02 10:45:20,990 INFO
>> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler:
>> Splitting all logs for node7,60020,1370118961527
>>
>>
>>
>> I tried to look at the file:
>> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls
>> /hbase/.logs/node7,60020,1370118961527-splitting/
>> Found 2 items
>> -rw-r--r--   3 hbase supergroup   41578360 2013-06-01 17:36
>>
>> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614
>> -rw-r--r--   3 hbase supergroup   18153457 2013-06-01 19:36
>>
>> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666
>>
>> I also tried to get the files from HDFS:
>> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get
>>
>> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614
>> .
>> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get
>>
>> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666
>> .
>> hadoop@node3:~/hadoop-1.0.3$ ll
>> -rw-r--r--  1 hadoop hadoop 41578360 jun  2 10:13
>> node7%2C60020%2C1370118961527.1370122562614
>> -rw-r--r--  1 hadoop hadoop 18153457 jun  2 10:13
>> node7%2C60020%2C1370118961527.1370129764666
>>
>>
>> I looked into the code, and the issues is because "installed = 2 but
>> only 0 done" (batch.done != batch.installed) which mean we are coming
>> back from "waitForSplittingCompletion" even if the split is not really
>> completed. The error message warning "Error splitting" is displayed
>> into the "SplitLogManager.setDone" message but is not giving much
>> information to track the root cause.
>>
>> So I looked into the regionservers, and found this:
>> Caused by: org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a
>> directory:
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1170)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
>>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:601)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
>>
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>         at $Proxy8.create(Unknown Source)
>>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:601)
>>         at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>         at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>         at $Proxy8.create(Unknown Source)
>>         at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
>>         ... 19 more
>>
>> So I looked in HDFS and found this:
>> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/
>> Found 5 items
>> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.oldlogs
>> -rw-r--r--   3 hbase supergroup        841 2013-06-01 16:56
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.regioninfo
>> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/@
>> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/a
>> -rw-r--r--   3 hbase supergroup       5159 2013-06-01 15:43
>> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits
>>
>>
>> So, 2 things again here.
>>
>> 1) Should the region server send more information of the failure to
>> the master the the master can display the failure cause on the logs?
>> 2) recovered.edits should  have not been there. And the failure is
>> because the SplitLogWorker tries to create a recovered.edits file but
>> it's already existing.
>>
>> This happens because my cluster failed twice in a row yesterday. And
>> most probably the second time occurs while the logsplit was still in
>> progress. Therefor, this file is just a left over.
>>
>> Now, what should we do. I can manually delete this file and the
>> process will take over and everything will be fine, but should we
>> update the HLogSplitter or HLogSplitter to validate this file is not
>> there first and delete it if required? What's the risk of deleting
>> this file automatically? There should not be any other process writing
>> it, right? Worst case, we can also rename it so if really required we
>> can replay it later?
>>
>> For my own needs, I removed the file manually and everything is
>> running now (re-assigning regions, etc.).
>>
>> What's your opinion on that? Should we remove the file? Or should we
>> not because it's to dangerous?
>>
>> JM
>>

Mime
View raw message