hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Liochon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9675) Root-cause analysis is difficult with AsyncProcess
Date Wed, 16 Oct 2013 21:26:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797274#comment-13797274
] 

Nicolas Liochon commented on HBASE-9675:
----------------------------------------

Hum.
- We've got a regression there: we don't print the servers as we used to do (we build the
string but we do not return it).
- We've got some stuff enabled only in trace mode. I think it's not very useful: when you
want it you don't have the trace enabled.
 - something we could do is:
     - start to store the exceptions after a given number, something like 15 failures.
     - then if it fails, we will return the full path for one row that failed until the end.
Something like [#15, hbaseserver13:60030,47894798: RegionMoved to hbaseserver14; #16, hbaseserver14:60030,7648764:
NotServingRegion; #17 ....]

What do you think?


> Root-cause analysis is difficult with AsyncProcess
> --------------------------------------------------
>
>                 Key: HBASE-9675
>                 URL: https://issues.apache.org/jira/browse/HBASE-9675
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 0.95.2
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> The AsyncProcess swallows up informative exceptions returned from the RS. For instance,
diagnosing the cause of a failed mapreduce task is pretty opaque right now. The client log
tells me that my async requests have been retried and rejected, but not why:
> {noformat}
> 2013-09-27 22:47:03,391 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180,
currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
> // repeat about 30 times
> 2013-09-27 22:53:13,886 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=12140, tasksDone=12135,
currentTasksDone=12135, tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:13,890 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 4 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,030 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,179 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,251 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,277 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess:
Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9.,
hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:26,862 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting
for the global number of running tasks to be equals or less than 0, tasksSent=12186, tasksDone=12185,
currentTasksDone=12185, tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:26,915 INFO [main] org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x14160ad6eb404c3
> 2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 0x14160ad6eb404c3
closed
> 2013-09-27 22:53:26,920 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn: EventThread
shut down
> 2013-09-27 22:53:26,936 ERROR [main] org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:hrt_qa (auth:SIMPLE) cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
> 2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running
child : org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36 actions:
ConnectException: 4 times, FailedServerException: 32 times, 
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
> 	at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
> 	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
> 	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
> 	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
> 	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> {noformat}
> Likewise, there's nothing "broken" in the RS logs. It's sending back RegionTooBusyExceptions
but neither RS nor client inform the user of this.
> We need to add logging somewhere, probably on the client side, but maybe both, to tell
our owner-operator what's going on. The client log doesn't say much and all you have in the
RS log is a whole bunch of flush and compaction activity. In this case, it would have been
nice to see something like "RegionTooBusyException: above the memstore limit". Even better,
include a recommendation to get around the problem, such as "slow down, tiger, that's quite
the write throughput" or "maybe you want to increase memstore size a bit". Or maybe the right
approach is to advise tweaks to the client settings (retries, sleep periods, buffer size,
&c.)?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message