hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Dimiduk (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-9675) Root-cause analysis is difficult with AsyncProcess
Date Sat, 28 Sep 2013 00:00:02 GMT
Nick Dimiduk created HBASE-9675:
-----------------------------------

             Summary: Root-cause analysis is difficult with AsyncProcess
                 Key: HBASE-9675
                 URL: https://issues.apache.org/jira/browse/HBASE-9675
             Project: HBase
          Issue Type: Improvement
          Components: Client
    Affects Versions: 0.95.2
            Reporter: Nick Dimiduk


The AsyncProcess swallows up informative exceptions returned from the RS. For instance, diagnosing
the cause of a failed mapreduce task is pretty opaque right now. The client log tells me that
my async requests have been retried and rejected, but not why:

{noformat}
2013-09-27 22:47:03,391 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180,
currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
// repeat about 30 times
2013-09-27 22:53:13,886 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=12140, tasksDone=12135,
currentTasksDone=12135, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:13,890 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 4 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,030 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,179 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,251 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:14,277 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
2013-09-27 22:53:26,862 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=12186, tasksDone=12185,
currentTasksDone=12185, tableName=IntegrationTestLoadAndVerify
2013-09-27 22:53:26,915 INFO [main] org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14160ad6eb404c3
2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 0x14160ad6eb404c3
closed
2013-09-27 22:53:26,920 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn: EventThread
shut down
2013-09-27 22:53:26,936 ERROR [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hrt_qa (auth:SIMPLE) cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running
child : org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36 actions:
ConnectException: 4 times, FailedServerException: 32 times, 
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
	at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
{noformat}

Likewise, there's nothing "broken" in the RS logs. It's sending back RegionTooBusyExceptions
but neither RS nor client inform the user of this.

We need to add logging somewhere, probably on the client side, but maybe both, to tell our
owner-operator what's going on. The client log doesn't say much and all you have in the RS
log is a whole bunch of flush and compaction activity. In this case, it would have been nice
to see something like "RegionTooBusyException: above the memstore limit". Even better, include
a recommendation to get around the problem, such as "slow down, tiger, that's quite the write
throughput" or "maybe you want to increase memstore size a bit". Or maybe the right approach
is to advise tweaks to the client settings (retries, sleep periods, buffer size, &c.)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message