lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rallavagu <rallav...@gmail.com>
Subject Re: Recovery Thread Blocked
Date Sun, 04 Oct 2015 17:14:25 GMT
As there were no responses so far, I assume that this is not a very 
common issue that folks come across. So, I went into source (4.6.1) to 
see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
       // to be air tight we must also check after lock
       if (cc.isShutDown()) {
         log.warn("Skipping recovery because Solr is shutdown");
         return;
       }
       log.info("Running recovery - first canceling any ongoing recovery");
       cancelRecovery();

       while (recoveryRunning) {
         try {
           recoveryLock.wait(1000);
         } catch (InterruptedException e) {

         }
         // check again for those that were waiting
         if (cc.isShutDown()) {
           log.warn("Skipping recovery because Solr is shutdown");
           return;
         }
         if (closed) return;
       }

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
     synchronized (recoveryLock) {
       if (recoveryStrat != null && recoveryRunning) {
         recoveryStrat.close();
         while (true) {
           try {
             recoveryStrat.join();
           } catch (InterruptedException e) {
             // not interruptible - keep waiting
             continue;
           }
           break;
         }

         recoveryRunning = false;
         recoveryLock.notifyAll();
       }
     }
   }

As per the stack trace "recoveryStrat.join()" is where things are 
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads 
would be waiting on. Is it possible that ZK is not responding or 
something else like Operating System resources could cause this? Thanks.


On 10/2/15 4:17 PM, Rallavagu wrote:
> Here is the stack trace of the thread that is holding the lock.
>
>
> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> native_blocked, daemon
>      -- Waiting for notification on:
> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>      at
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>      at
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
>
>      at
> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)
>      at java/lang/Object.wait(J)V(Native Method)
>      at java/lang/Thread.join(Thread.java:1206)
>      ^-- Lock released while waiting:
> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
>      at java/lang/Thread.join(Thread.java:1259)
>      at
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
>
>      ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
>      at
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
>
>      ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
>      at
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>
>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>
>
> Stack trace of one of the 870 threads that is waiting for the lock to be
> released.
>
> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> native_blocked, daemon
>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>      at
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>      at
> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>      at
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>
>      at
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>
>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
>
> On 10/2/15 4:12 PM, Rallavagu wrote:
>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper
>>
>> During updates, some nodes are going very high cpu and becomes
>> unavailable. The thread dump shows the following thread is blocked 870
>> threads which explains high CPU. Any clues on where to look?
>>
>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
>> native_blocked, daemon
>>      -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
>> lock]
>>      at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
>>      at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
>>      at
>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
>>      at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
>>      at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
>>      at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
>>      at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
>>      at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
>>      at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>      at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
>>      at
>> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
>>
>>
>>      at
>> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
>>
>>
>>      at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

Mime
View raw message