lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William Pierce" <evalsi...@hotmail.com>
Subject Re: Fatal exception in solr 1.3+ replication
Date Sat, 15 Nov 2008 23:56:46 GMT
Mark,

That sounds great!    Good luck with the cleaning :-)

Let me know how I can get a patch --- I'd prefer not do a solr build from 
source since we are not Java savvy here....:-(

- Bill

--------------------------------------------------
From: "Mark Miller" <markrmiller@gmail.com>
Sent: Saturday, November 15, 2008 12:43 PM
To: <solr-user@lucene.apache.org>
Subject: Re: Fatal exception in solr 1.3+ replication

> Okay, I'm fairly certain I've found it. As usual, take a walk and the 
> solution pops into your head out of the blue.
>
> It looks like Lucene's IndexReader reopen call is not very friendly with 
> the FSDirectory implementation. If you call reopen and it returns a new 
> IndexReader, it creates a new reference on the Directory - so if you 
> reopen an IndexReader that was originally opened with a non Directory 
> parameter (String or File instead), both Readers (the reopened one and the 
> one your reopening on) will close the Directory when they close. Thats not 
> right. Thats how we get to 0 faster than we should. So its kind of a 
> Lucene issue.
>
> My guess that this is hidden in the trunk was right, because I think we 
> are no longer using String, File based IndexReader opens, which means our 
> IndexReaders don't attempt to close their underlying Directories now.
>
> I can probably send you patch for the revision your on to hide this as 
> well, but I'm already in the doghouse on cleaning right now ; ) The way my 
> brain works, I'll probably be back to this later though.
>
> - Mark
>
>
> William Pierce wrote:
>>> Trunk may actually still hide the issue (possibly), but something really 
>>> funky seems to have gone on and I can't find it yet. Do you have any 
>>> custom code interacting with solr?
>>
>> None whatsoever...I am using out-of-the-box solr 1.3 (build of 10/23).  I 
>> am using my C# app to http requests to my solr instance.
>>
>> Is there something you want me to try at my end that might give you a 
>> clue? Let me know and I can try to help out.
>>
>> Best,
>>
>> - Bill
>>
>> --------------------------------------------------
>> From: "Mark Miller" <markrmiller@gmail.com>
>> Sent: Saturday, November 15, 2008 10:59 AM
>> To: <solr-user@lucene.apache.org>
>> Subject: Re: Fatal exception in solr 1.3+ replication
>>
>>> Havn't given up, but this has really got me so far. For every path, 
>>> FSDirectory allows just one instance of FSDirectory to exist, and it 
>>> keeps a ref count of how many have been returned from openDirectory for 
>>> a given path. An FSDirectory will not actually be closed unless all 
>>> references to it are released (it doesn't actually even do anything in 
>>> close, other than drop the reference). So pretty much, the only way to 
>>> get into trouble is to call close enough times to equal how many times 
>>> you called openDirectory, and then try to use the FSDirectory again. 
>>> This is what your stack trace indicates is happening.
>>>
>>> So we can get the call hierarchy for directory.close() in solr, and we 
>>> find that everything is pretty matched up...at worst it looks that a 
>>> reference might not be closed - but that doesn't hurt 
>>> anything...FSDirectory will just think there is something out there 
>>> holding onto a reference for that directory and allow you to continue 
>>> using it, even though no such reference might exist. Its only when 
>>> enough closes are called that an instance will be marked as closed (a 
>>> further openDirectory will return a new open instance). So to get your 
>>> error, something has to close a Directory that it did not get from 
>>> openDirectory.
>>>
>>> We know that the IndexReader that is trying to be reopened must have 
>>> called openDirectory (or something called openDirectory for it) and we 
>>> know that it hasn't called close (the IndexReader has already held up to 
>>> an ensureOpen call on itself in that stack trace). Something else must 
>>> have closed it. I can't find this happening. Nothing else calls close on 
>>> Directory unless it called openDirectory (that I can find using all of 
>>> Eclipses magical goodness).
>>>
>>> So how does the refcount on the Directory hit 0? I can't find or 
>>> duplicate yet...
>>>
>>> Trunk may actually still hide the issue (possibly), but something really 
>>> funky seems to have gone on and I can't find it yet. Do you have any 
>>> custom code interacting with solr?
>>>
>>> - Mark
>>>
>>> William Pierce wrote:
>>>> Mark,
>>>>
>>>> Thanks for your help and lucid exposition of what may be going on....I 
>>>> will await to hear from you before I plough ahead with the latest trunk 
>>>> bits....
>>>>
>>>> Best regards,
>>>>
>>>> -- Bill
>>>>
>>>> --------------------------------------------------
>>>> From: "Mark Miller" <markrmiller@gmail.com>
>>>> Sent: Saturday, November 15, 2008 7:29 AM
>>>> To: <solr-user@lucene.apache.org>
>>>> Subject: Re: Fatal exception in solr 1.3+ replication
>>>>
>>>>> Hang on. The more I look at this, the more I am thinking that was not

>>>>> the problem. Directories are pretty strictly managed by Lucene at the

>>>>> moment, and it should actually be pretty difficult to have one closed

>>>>> out from under you. They are singletons and reference counted. The 
>>>>> IndexReader would have to have been closed if it was responsible for

>>>>> closing the Directory, and in that case, we would not be trying to 
>>>>> reopen it. The Searcher we get the IndexReader from has been inc ref'd

>>>>> to ensure it won't close. All I can think is that something is 
>>>>> grabbing/stealing a Directory that didn't directly ask for it from 
>>>>> FSDirectory.getDirectory, and is then closing it. I'm trying to hunt

>>>>> where that could be happening now. Hope to spot something, but appears

>>>>> pretty mysterious at the moment. I suppose another option is a nasty

>>>>> little race condition - I've been trying to repeat the error by 
>>>>> sending lots of update/search requests from multiple threads with no

>>>>> luck though. Looks like I may have to throw in snap puller code 
>>>>> (havn't look too heavily into any of that before).
>>>>>
>>>>> At worst, if/when a fix is discovered, you will probably be able to 
>>>>> apply just the fix to the revision your working with.
>>>>>
>>>>> - Mark
>>>>>
>>>>> William Pierce wrote:
>>>>>> Mark,
>>>>>>
>>>>>> Thanks for your response --- I do appreciate all you volunteers 
>>>>>> working to provide such a nice system!
>>>>>>
>>>>>> Anyway,  I will try the trunk bits as you said.  The only problem
is 
>>>>>> that the later the trunk I use from 1.3,  the more of post 1.3 
>>>>>> capability I get. And I feel somewhat exposed running these bits
in 
>>>>>> our production environment...
>>>>>>
>>>>>> Not sure if there's another approach?
>>>>>>
>>>>>> Thanks,
>>>>>> -Bill
>>>>>>
>>>>>> --------------------------------------------------
>>>>>> From: "Mark Miller" <markrmiller@gmail.com>
>>>>>> Sent: Friday, November 14, 2008 8:43 PM
>>>>>> To: <solr-user@lucene.apache.org>
>>>>>> Subject: Re: Fatal exception in solr 1.3+ replication
>>>>>>
>>>>>>> Hey William, sorry about the trouble. I have to look at this

>>>>>>> further, but I think the issue is fixed if you grab the latest
trunk 
>>>>>>> build. Solr-465 should inadvertently fix things - before that
patch, 
>>>>>>> a deprecated constructor for solrsearcher was being called -
this 
>>>>>>> constructor caused the underlying IndexReader to close its own

>>>>>>> Directory, and since IndexReaders are reopened, we don't want
that.
>>>>>>>
>>>>>>> Mark Miller wrote:
>>>>>>>> Looks like there might be an issue with the reopen - I'm
not seeing 
>>>>>>>> what it could be offhand though. Have to find what could
be closing 
>>>>>>>> a Directory unexpectedly...I'll try to take a further look
over the 
>>>>>>>> weekend.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> William Pierce wrote:
>>>>>>>>> Folks:
>>>>>>>>>
>>>>>>>>> I am using the nightly build of 1.3 as of Oct 23 so as
to use the 
>>>>>>>>> replication handler.   I am running on windows 2003 server
with 
>>>>>>>>> tomcat 6.0.14.   Everything was running fine until I
noticed that 
>>>>>>>>> certain updated records were not showing up on the slave.
 Further 
>>>>>>>>> investigation showed me that the failures have indeed
been 
>>>>>>>>> occurring since early this morning with a fatal exception....here

>>>>>>>>> is a segment of the tomcat log:
>>>>>>>>>   INFO: Total time taken for download : 0 secs
>>>>>>>>>   Nov 14, 2008 5:34:24 AM org.apache.solr.handler.SnapPuller

>>>>>>>>> fetchLatestIndex
>>>>>>>>>   INFO: Conf files are not downloaded or are in sync
>>>>>>>>>   Nov 14, 2008 5:34:24 AM 
>>>>>>>>> org.apache.solr.update.DirectUpdateHandler2 commit
>>>>>>>>>   INFO: start 
>>>>>>>>> commit(optimize=false,waitFlush=true,waitSearcher=true)
>>>>>>>>>   Nov 14, 2008 5:34:24 AM 
>>>>>>>>> org.apache.solr.handler.ReplicationHandler doSnapPull
>>>>>>>>>   SEVERE: SnapPull failed   org.apache.solr.common.SolrException:

>>>>>>>>> Snappull failed :    at 
>>>>>>>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:278)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.handler.ReplicationHandler.doSnapPull(ReplicationHandler.java:208)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:121)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
>>>>>>>>>    at 
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
>>>>>>>>>    at java.lang.Thread.run(Thread.java:619)
>>>>>>>>>   Caused by: java.lang.RuntimeException: 
>>>>>>>>> org.apache.lucene.store.AlreadyClosedException: this
Directory is 
>>>>>>>>> closed
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1037)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:350)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:353)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:265)
>>>>>>>>>    ... 11 more
>>>>>>>>>   Caused by: org.apache.lucene.store.AlreadyClosedException:
this 
>>>>>>>>> Directory is closed
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.store.Directory.ensureOpen(Directory.java:220)
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.store.FSDirectory.list(FSDirectory.java:320)
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:533)
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
>>>>>>>>>    at 
>>>>>>>>> org.apache.lucene.index.DirectoryIndexReader.reopen(DirectoryIndexReader.java:124)
>>>>>>>>>    at 
>>>>>>>>> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1016)
>>>>>>>>>    ... 14 more
>>>>>>>>>   Nov 14, 2008 5:38:52 AM 
>>>>>>>>> org.apache.solr.update.DirectUpdateHandler2 commit
>>>>>>>>>
>>>>>>>>> Any ideas, anyone?
>>>>>>>>>
>>>>>>>>> -- Bill
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>>
>
> 

Mime
View raw message