lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-665) temporary file access denied on Windows
Date Wed, 30 Aug 2006 10:57:26 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-665?page=comments#action_12431533 ] 
            
Michael McCandless commented on LUCENE-665:
-------------------------------------------


> But I am not sure how this should affect decision on applying this fix
> - there would always be user machines out there running Lucene and
> also running other services.

> We could tell users - hey, make sure that none of the other services /
> software running on your machine is holding / touching / examining
> Lucene index files, otherwise, don't blame Lucene - but this is not
> easily done. Not all developers out there have control or
> understanding of what's running on their machines - some programs are
> installed by a system support, you know how it is.

> So, while it is understandable that Lucene would fail if there is a
> malicious software that actually grabs and holds Lucene files and
> interfere with them (for "long" periods of times), it would be nice to
> keep these failures at minimum.

Alas I still cannot reproduce this.  I think there must be some
environmental difference.

I agree, Lucene should strive to be robust to the various
"environmental differences" (OS, filesystem, permissions, virus
checkers installed, etc.) up to a degree, however, I still think it's
best to get to the root cause of these errors so users have the most
information possible: the more information the better.  Plus this may
help us build a more accurate fix to the issue than sleeping /
retrying.

For example, if it turns out this happens only under Windows XP SP1,
yes we can try to make Lucene robust to these errors, but in addition,
we should document this so that those users that have the freedom to
do so could upgrade to SP2.  (NOTE: I'm just using this as an example:
we still have no idea if it's SP1/SP2 difference that "fixes" the
errors in my testing of this issue).

Given that we have two environments, one very reliably showing these
IO problems (yours) and one very reliably not (mine), this is really a
great chance to get to the root cause.  Here are the details of my
env:

  OS: Windows XP Pro, SP2
  Java: Sun JDK 1.5.0_07
  Command line: java org.junit.runner.JUnitCore org.apache.lucene.index.TestInterleavedAddAndRemoves
  Services running: Google desktop, Symantec AV


> temporary file access denied on Windows
> ---------------------------------------
>
>                 Key: LUCENE-665
>                 URL: http://issues.apache.org/jira/browse/LUCENE-665
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: Doron Cohen
>         Attachments: FSDirectory_Retry_Logic.patch, FSDirs_Retry_Logic_3.patch, Test_Output.txt,
TestInterleavedAddAndRemoves.java
>
>
> When interleaving adds and removes there is frequent opening/closing of readers and writers.

> I tried to measure performance in such a scenario (for issue 565), but the performance
test failed  - the indexing process crashed consistently with file "access denied" errors
- "cannot create a lock file" in "lockFile.createNewFile()" and "cannot rename file".
> This is related to:
> - issue 516 (a closed issue: "TestFSDirectory fails on Windows") - http://issues.apache.org/jira/browse/LUCENE-516

> - user list questions due to file errors:
>   - http://www.nabble.com/OutOfMemory-and-IOException-Access-Denied-errors-tf1649795.html
>   - http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
> - discussion on lock-less commits http://www.nabble.com/Lock-less-commits-tf2126935.html
> My test setup is: XP (SP1), JAVA 1.5 - both SUN and IBM SDKs. 
> I noticed that the problem is more frequent when locks are created on one disk and the
index on another. Both are NTFS with Windows indexing service enabled. I suspect this indexing
service might be related - keeping files busy for a while, but don't know for sure.
> After experimenting with it I conclude that these problems - at least in my scenario
- are due to a temporary situation - the FS, or the OS, is *temporarily* holding references
to files or folders, preventing from renaming them, deleting them, or creating new files in
certain directories. 
> So I added to FSDirectory a retry logic in cases the error was related to "Access Denied".
This is the same approach brought in http://www.nabble.com/running-a-lucene-indexing-app-as-a-windows-service-on-xp%2C-crashing-tf2053536.html
- there, in addition to the retry, gc() is invoked (I did not gc()). This is based on the
*hope* that a access-denied situation would vanish after a small delay, and the retry would
succeed.
> I modified FSDirectory this way for "Access Denied" errors during creating a new files,
renaming a file.
> This worked fine for me. The performance test that failed before, now managed to complete.
There should be no performance implications due to this modification, because only the cases
that would otherwise wrongly fail are now delaying some extra millis and retry.
> I am attaching here a patch - FSDirectory_Retry_Logic.patch - that has these changes
to FSDirectory. 
> All "ant test" tests pass with this patch.
> Also attaching a test case that demostrates the problem - at least on my machine. There
two tests cases in that test file - one that works in system temp (like most Lucene tests)
and one that creates the index in a different disk. The latter case can only run if the path
("D:" , "tmp") is valid.
> It would be great if people that experienced these problems could try out this patch
and comment whether it made any difference for them. 
> If it turns out useful for others as well, including this patch in the code might help
to relieve some of those "frustration" user cases.
> A comment on state of proposed patch: 
> - It is not a "ready to deploy" code - it has some debug printing, showing the cases
that the "retry logic" actually took place. 
> - I am not sure if current 30ms is the right delay... why not 50ms? 10ms? This is currently
defined by a constant.
> - Should a call to gc() be added? (I think not.)
> - Should the retry be attempted also on "non access-denied" exceptions? (I think not).
> - I feel it is somewhat "woodoo programming", but though I don't like it, it seems to
work... 
> Attached files:
> 1. TestInterleavedAddAndRemoves.java - the LONG test that fails on XP without the patch
and passes with the patch.
> 2. FSDirectory_Retry_Logic.patch
> 3. Test_Output.txt- output of the test with the patch, on my XP. Only the createNewFile()
case had to be bypassed in this test, but for another program I also saw the renameFile()
being bypassed.
> - Doron

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message