lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel2...@gmail.com>
Subject Re: checksum failed (hardware problem?)
Date Sun, 30 Sep 2018 19:48:45 GMT
Thank you, Simon. Which basically points that something related to env and
was causing the checksum failures than any lucene/solr issue.

Eric - I did check with hardware folks and they are aware of some VMware
issue where the VM hosted in HCI environment is coming into some halt state
for minute or so and may be loosing connections to disk/network.  So that
probably may be the reason of index corruption though they have not been
able to find anything specific from logs during the time Solr run into issue

Also I had again issue where Solr is loosing the connection with zookeeper
(Client session timed out, have not heard from server in 8367ms for
sessionid 0x0)  Does that points to similar hardware issue, Any suggestions?

Thanks,
Susheel

2018-09-29 17:30:44.070 INFO
(searcherExecutor-7-thread-1-processing-n:server54:8080_solr
x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
[COLL_shard4_replica2] Registered new searcher
Searcher@7a4465b1[COLL_shard4_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
Uninverting(_hapw(6.6.2):C5/2:delGen=1)
Uninverting(_hapx(6.6.2):C2/1:delGen=1)
Uninverting(_hapy(6.6.2):C2/1:delGen=1)
Uninverting(_hapz(6.6.2):C3/1:delGen=1)
Uninverting(_haq0(6.6.2):C6/3:delGen=1)
Uninverting(_haq1(6.6.2):C1)))}
2018-09-29 17:30:52.390 WARN
(zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
[   ] o.a.z.ClientCnxn Client session timed out, have not heard from
server in 8367ms for sessionid 0x0
2018-09-29 17:31:01.302 WARN
(zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
[   ] o.a.z.ClientCnxn Client session timed out, have not heard from
server in 8812ms for sessionid 0x0
2018-09-29 17:31:14.049 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
reestablished.
2018-09-29 17:31:14.049 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
core states after session expiration.
2018-09-29 17:31:14.051 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
-> (15)
2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
[COLL_shard4_replica2]  webapp=/solr path=/admin/ping
params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin}
webapp=/solr path=/admin/ping
params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin}
hits=4989979 status=0 QTime=0




On Wed, Sep 26, 2018 at 9:44 AM simon <mtnest46@gmail.com> wrote:

> I saw something like this a year ago which i reported as a possible bug  (
> https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
> description and stack traces)
>
> This occurred very randomly on an AWS instance; moving the index directory
> to a different file system did not fix the problem Eventually I cloned our
> environment to a new AWS instance, which proved to be the solution. Why, I
> have no idea...
>
> -Simon
>
> On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <susheel2777@gmail.com>
> wrote:
>
> > Got it. I'll have first hardware folks check and if they don't see/find
> > anything suspicious then i'll return here.
> >
> > Wondering if any body has seen similar error and if they were able to
> > confirm if it was hardware fault or so.
> >
> > Thnx
> >
> > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <erickerickson@gmail.com>
> > wrote:
> >
> > > Mind you it could _still_ be Solr/Lucene, but let's check the hardware
> > > first ;)
> > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <susheel2777@gmail.com>
> > > wrote:
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks so much for your reply.  I'll now look mostly into any
> possible
> > > > hardware issues than Solr/Lucene.
> > > >
> > > > Thanks again.
> > > >
> > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > There are several of reasons this would "suddenly" start appearing.
> > > > > 1> Your disk went bad and some sector is no longer faithfully
> > > > > recording the bits. In this case the checksum will be wrong
> > > > > 2> You ran out of disk space sometime and the index was corrupted.
> > > > > This isn't really a hardware problem.
> > > > > 3> Your disk controller is going wonky and not reading reliably.
> > > > >
> > > > > The "possible hardware issue" message is to alert you that this is
> > > > > highly unusual and you should at leasts consider doing integrity
> > > > > checks on your disk before assuming it's a Solr/Lucene problem
> > > > >
> > > > > Best,
> > > > > Erick
> > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> susheel2777@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am still trying to understand the corrupt index exception
we
> saw
> > > in our
> > > > > > logs. What does the hardware problem comment indicates here?
> Does
> > > that
> > > > > > mean it caused most likely due to hardware issue?
> > > > > >
> > > > > > We never had this problem in last couple of months. The Solr
is
> > > 6.6.2 and
> > > > > > ZK: 3.4.10.
> > > > > >
> > > > > > Please share your thoughts.
> > > > > >
> > > > > > Thanks,
> > > > > > Susheel
> > > > > >
> > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> checksum
> > > > > > failed *(hardware
> > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > > >
> > > > >
> > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > [slice=_i27s_Lucene50_0.tim])
> > > > > >
> > > > > > It suddenly started in the logs and before which there was no
> such
> > > error.
> > > > > > Searches & ingestions all seems to be working prior to that.
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > > update-script#processAdd:
> > > > > >
> > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > 08480_1-en_US
> > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > > > org.apache.solr.common.SolrException: Exception writing document
> > id
> > > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > 1-en_US
> > > to
> > > > > the
> > > > > > index; possible analysis error.
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > ateHandler2.java:206)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > dd(RunUpdateProcessorFactory.java:67)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > essAdd(UpdateRequestProcessor.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > versionAdd(DistributedUpdateProcessor.java:1192)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > processAdd(DistributedUpdateProcessor.java:748)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > essAdd(UpdateRequestProcessor.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > sorFactory.java:380)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > nLoader.java:98)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > odec.java:306)
> > > > > > at
> > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > c.java:251)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > odec.java:271)
> > > > > > at
> > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > c.java:251)
> > > > > > at
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > dec.java:173)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > s(JavabinLoader.java:108)
> > > > > > at
> > > > >
> > >
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > questHandler.java:97)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > stBody(ContentStreamHandlerBase.java:68)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > uestHandlerBase.java:173)
> > > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > > > at
> > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > 529)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > atchFilter.java:361)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > atchFilter.java:305)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > r(ServletHandler.java:1691)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > dler.java:582)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > Handler.java:143)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > ndler.java:548)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > SessionHandler.java:226)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > ContextHandler.java:1180)
> > > > > > at
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > ler.java:512)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > SessionHandler.java:185)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > ContextHandler.java:1112)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > Handler.java:141)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > ndle(ContextHandlerCollection.java:213)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > HandlerCollection.java:119)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > erWrapper.java:134)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > iteHandler.java:335)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > erWrapper.java:134)
> > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > java:320)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > ction.java:251)
> > > > > > at
> > > > > > org.eclipse.jetty.io
> > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > n.java:273)
> > > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
> > java:95)
> > > > > > at
> > > > > > org.eclipse.jetty.io
> > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .produceConsume(ExecuteProduceConsume.java:148)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .run(ExecuteProduceConsume.java:136)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > ThreadPool.java:671)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > hreadPool.java:589)
> > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > > > > IndexWriter
> > > > > > is closed
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > > > at
> > > > >
> > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > er.java:1567)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > irectUpdateHandler2.java:924)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > lues(DirectUpdateHandler2.java:913)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > irectUpdateHandler2.java:302)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > dateHandler2.java:239)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > ateHandler2.java:194)
> > > > > > ... 54 more
> > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> checksum
> > > failed
> > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > > > >
> > > > >
> > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > [slice=_i27s_Lucene50_0.tim]))
> > > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > java:419)
> > > > > > at
> > > > >
> > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > til.java:526)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > kIntegrity(BlockTreeTermsReader.java:336)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > > > at
> > > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > > > at
> > > > >
> > >
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > > > at
> > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > 3931)
> > > > > > at
> > > org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > currentMergeScheduler.java:624)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > .run(ConcurrentMergeScheduler.java:661)
> > > > > >
> > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message