lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-4117) IO error while trying to get the size of the Directory
Date Wed, 28 Nov 2012 15:28:58 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505530#comment-13505530
] 

Eks Dev edited comment on SOLR-4117 at 11/28/12 3:27 PM:
---------------------------------------------------------

fwiw, we *think* we observed the following problem in simple master slave setup with NRTCachingDirectory...
I am not sure it has something to do with issue, because ewe did not see this exception, anyhow
  

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should actually not
happen)

We speculate that solr now does replication directly to Directory implementation and does
not ensure that replicated files get fsck-ed completely after replication. As far as I remember,
replication was going to /temp (disk) and than moving files if all went ok. Working under
assumption that everything is already persisted. Maybe this invariant does not hold any more
and some explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really debug-friendly environment

Here Exception after  "hard" restart:

Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:804)
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
   at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:973)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1003)
   ... 10 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1441)
   at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1553)
   at org.apache.solr.core.SolrCore.<init>(SolrCore.java:779)
   ... 13 more
Caused by: java.io.FileNotFoundException: ...\core0\data\index\segments_1 (The system cannot
find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
   at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
   at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
   at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:281)
   at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
   at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:668)
   at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
   at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:87)
   at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
   at org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:120)
   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1417)
....
 
                
      was (Author: eksdev):
    fwiw, we *think* we observed the following problem in simple master slave setup with NRTCachingDirectory...
I am not sure it has something to do with issue, because ewe did not see this exception, anyhow
  

on replication, slave gets the index from master and works fine, then on:
1. graceful restart, the world looks fine 
2. kill -9 or such, solr does not start because an index gets corrupt (should actually not
happen)

We speculate that solr now does replication directly to Directory implementation and does
not ensure that replicated files get fsck-ed completely after replication. As far as I remember,
replication was going to /temp (disk) and than moving files if all went ok. Working under
assumption that everything is already persisted. Maybe this invariant does not hold any more
and some explicit fsck is needed for caching directories? 

I might be completely wrong, we just observed symptoms in not really debug-friendly environment



 
                  
> IO error while trying to get the size of the Directory
> ------------------------------------------------------
>
>                 Key: SOLR-4117
>                 URL: https://issues.apache.org/jira/browse/SOLR-4117
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>         Environment: 5.0.0.2012.11.28.10.42.06
> Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 5.0
>
>
> With SOLR-4032 fixed we see other issues when randomly taking down nodes (nicely via
tomcat restart) while indexing a few million web pages from Hadoop. We do make sure that at
least one node is up for a shard but due to recovery issues it may not be live.
> One node seems to work but generates IO errors in the log and ZookeeperExeption in the
GUI. In the GUI we only see:
> {code}
> SolrCore Initialization Failures
>     openindex_f: org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:

> Please check your logs for more information
> {code}
> and in the log we only see the following exception:
> {code}
> 2012-11-28 11:47:26,652 ERROR [solr.handler.ReplicationHandler] - [http-8080-exec-28]
- : IO error while trying to get the size of the Directory:org.apache.lucene.store.NoSuchDirectoryException:
directory '/opt/solr/cores/shard_f/data/index' does not exist
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
>         at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
>         at org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
>         at org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:472)
>         at org.apache.solr.handler.ReplicationHandler.getReplicationDetails(ReplicationHandler.java:568)
>         at org.apache.solr.handler.ReplicationHandler.handleRequestBody(ReplicationHandler.java:213)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1830)
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:476)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>         at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
>         at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
>         at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message