Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <9866504.170621288608804406.JavaMail.jira@thor>
Date: Mon, 1 Nov 2010 06:53:24 -0400 (EDT)
From: "Nico Krijnen (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Issue Comment Edited: (LUCENE-2729) Index corruption after
 'read past EOF' under heavy update load and snapshot export
In-Reply-To: <26430505.143121288365871012.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926908#action_12926908 ] 

Nico Krijnen edited comment on LUCENE-2729 at 11/1/10 6:52 AM:
---------------------------------------------------------------

bq. Was there no original root cause here? Eg disk full?

This was one of the first things i thought, but the disk has more than enough free space: 200GB. Also, for this test we write the backup to a different disk - both for better performance and to prevent the disk with the index on it from running out of free space.

bq. Or is the "read past EOF" on closing an IndexReader w/ pending deletes really the first exception you see?

It is the first exception we see. We turned on quite a bit of additional logging but we have not been able to find anything weird happening before this error. I do expect something weird must have happened to cause the 'read past EOF'.

Do you have any clues as to what we could look for? - that might narrow the search.
We are able to consistently reproduce this on our test environment. So if you have clues to specific debug logging that should be turned on - we can do another test run.

bq. Does zoie somehow touch the index files?

We'll try to find out. For as far as I see the basic backup procedure is to grab the last 'commit snapshot', prevent it from being deleted (ZoieIndexDeletionPolicy), and write all the files mentioned in the commit snapshot to a NIO WritableByteChannel (proj.zoie.impl.indexing.internal.DiskIndexSnapshot#writeTo) - we call proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel) ourselves.

      was (Author: nkrijnen):
    bq. Was there no original root cause here? Eg disk full?

This was one of the first things i thought, but the disk has more than enough free space: 200GB. Also, for this test we write the backup to a different disk - both for better performance and to prevent the disk with the index on it from running out of free space.

bq. Or is the "read past EOF" on closing an IndexReader w/ pending deletes really the first exception you see?

It is the first exception we see. We turned on quite a bit of additional logging but we have not been able to find out anything weird happening before this error. I do expect something weird must have happened to cause the 'read past EOF'.

Do you have any clues as to what we could look for? - that might narrow the search.
We are able to consistently reproduce this on our test environment. So if you have clues to specific debug logging that should be turned on - we can do another test run.

bq. Does zoie somehow touch the index files?

We'll try to find out. For as far as I see the basic backup procedure is to grab the last 'commit snapshot', prevent it from being deleted (ZoieIndexDeletionPolicy), and write all the files mentioned in the commit snapshot to a NIO WritableByteChannel.
  
> Index corruption after 'read past EOF' under heavy update load and snapshot export
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2729
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Happens on both OS X 10.6 and Windows 2008 Server. Integrated with zoie (using a zoie snapshot from 2010-08-06: zoie-2.0.0-snapshot-20100806.jar).
>            Reporter: Nico Krijnen
>
> We have a system running lucene and zoie. We use lucene as a content store for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled backups of the index. This works fine for small indexes and when there are not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index is being changed a lot (lots of document additions and/or deletions), we almost always get a 'read past EOF' at some point, followed by lots of 'Lock obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
>     at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
>     at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>     at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
>     at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
>     at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
>     at org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
>     at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
>     at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
>     at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
>     Problem copying segments: Lock obtain timed out: org.apache.lucene.store.SingleInstanceLock@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.apache.lucene.store.SingleInstanceLock@5ad0b895: write.lock
>     at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
>     at proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
>     at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> {code}
> We get exactly the same behavour on both OS X and on Windows. On both zoie is using a SimpleFSDirectory.
> We also use a SingleInstanceLockFactory (since our process is the only one working with the index), but we get the same behaviour with a NativeFSLock.
> The snapshot backup is being made by calling:
> *proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)*
> Same issue in zoie JIRA:
> http://snaprojects.jira.com/browse/ZOIE-51

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org