hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
Date Fri, 26 Feb 2016 00:14:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168159#comment-15168159
] 

Sangjin Lee commented on YARN-4736:
-----------------------------------

This could be a bug in HBase. It seems like the HBase cluster was already shut down, but the
flush operation took a long time to finally error out (36 minutes):
{noformat}
2016-02-26 00:02:28,270 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
The collector service for application_1456425026132_0001 was removed
2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: Failed to get region
location 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Fri Feb 26 00:39:03 IST 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68065:
row 'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!�������!����M�����!YARN_CONTAINER!container_1456425026132_0001_01_000001,99999999999999'
on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=localhost,16201,1456365764939,
seqNum=0

	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:264)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:215)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
	at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109)
	at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
	at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
	at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
	at org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
	at org.apache.hadoop.yarn.server.timelineservice.storage.common.BufferedMutatorDelegator.flush(BufferedMutatorDelegator.java:66)
	at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:457)
	at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager$WriterFlushTask.run(TimelineCollectorManager.java:230)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68065: row 'timelineservice.entity,naga!yarn_cluster!flow_1456425026132_1!�������!����M�����!YARN_CONTAINER!container_1456425026132_0001_01_000001,99999999999999'
on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=localhost,16201,1456365764939,
seqNum=0
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	... 3 more
Caused by: java.net.ConnectException: Connection refused
{noformat}

But this seems to have put the HBase client stuck in the flush state. That thread seems to
be stuck:
{noformat}
"pool-14-thread-1" prio=10 tid=0x00007f4215268000 nid=0x46e6 waiting on condition [0x00007f41fe75d000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000eeb5a010> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
	at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374)
	at org.apache.hadoop.hbase.util.BoundedCompletionService.take(BoundedCompletionService.java:75)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:190)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
	at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
	at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1109)
	at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
	at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
	at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
	at org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
	- locked <0x00000000c246f268> (a org.apache.hadoop.hbase.client.BufferedMutatorImpl)
	at org.apache.hadoop.yarn.server.timelineservice.storage.common.BufferedMutatorDelegator.flush(BufferedMutatorDelegator.java:66)
	at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:457)
	at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager$WriterFlushTask.run(TimelineCollectorManager.java:230)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Note the {{flush()}} call is still there. The thread that's trying to stop the aux service
is unable to acquire the lock for {{BufferedMutatorImpl}} because that lock is held by the
thread that's stuck in {{flush()}}.

[~vrushalic], you might want to look to see if this is a known HBase issue. I don't think
this is something on our side.

> Issues with HBaseTimelineWriterImpl
> -----------------------------------
>
>                 Key: YARN-4736
>                 URL: https://issues.apache.org/jira/browse/YARN-4736
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Naganarasimha G R
>            Assignee: Vrushali C
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>         Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in the same node
had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the NM daemon
process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution successfully.

> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message