lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Vaillancourt <...@elementspace.com>
Subject SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load
Date Thu, 25 Jul 2013 23:44:22 GMT
Hey guys,

I am reaching out to the Solr list with a very vague issue: under high load
against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2
cores per instance), I eventually see failure messages related to
transaction logs, and shortly after these stacktraces occur the cluster
starts to fall apart.

To explain my setup:
- SolrCloud 4.3.1.
- Jetty 9.x.
- Oracle/Sun JDK 1.7.25 w/CMS.
- RHEL 6.x 64-bit.
- 3 instances, 1 per server.
- 3 shards.
- 2 replicas per shard.

The transaction log error I receive after about 10-30 minutes of load
testing is:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)
/opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.0000000000000000078:org.apache.solr.common.SolrException:
java.io.EOFException
        at
org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:182)
        at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
        at
org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
        at
org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:138)
        at
org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:125)
        at
org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:95)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
        at
org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:805)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
        at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
        at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
        at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.EOFException
        at
org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
        at
org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
        at
org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
        at
org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:160)
        ... 25 more
"

Eventually after a few of these stack traces, the cluster starts to lose
shards and replicas fail. Jetty then creates hung threads until hitting
OutOfMemory on native threads due to the maximum process ulimit.

I know this is quite a vague issue, so I'm not expecting a silver-bullet
answer, but I was wondering if anyone has suggestions on where to look
next? Does this sound Solr-related at all, or possibly system? Has anyone
seen this issue before, or has any hypothesis how to find out more?

I will reply shortly with a thread dump, taken from 1 locked-up node.

Thanks for any suggestions!

Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message