lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
Date Tue, 09 Nov 2010 14:20:09 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930130#action_12930130
] 

Michael McCandless commented on LUCENE-2729:
--------------------------------------------


Thank you for attaching the IW infoStream output!  Sorry it took so
long for me to respond.

Aside: it is sad but there is no master TODO list in open source.  It
all comes down to our own email inboxes, todo lists, post-it notes all
over the place, etc., and (in my case anyway) things sometimes fall
past the event horizon.

So please if I don't respond in a day or two on an active issue, bump
it again (put a comment on the issue)!  I'd much rather people
over-nag than under-nag but unfortunately under-nag is far far more
common and it causes important issues to languish unnecessarily.

OK back to the issue :)

I looked through the infoStream but I don't see a smoking gun.  Ie,
the logs indicate that nowhere did Lucene try to delete/overwrite
those zero-length files; I see other files being deleted, so, this is
what I'd expect given that ZoieDeletionPolicy is presumably protecting
the segments_3t commit point (to back up its files).

I do see some spooky "other" exceptions, though... these are the first
2 exceptions I see in the log:

{noformat}
14:27:41,290 [bigIndexBuilder_QueueProcessor_3] WARN  com.ds.acm.logic.impl.AssetManagerImpl
-
  Ignoring AssetNotFoundException trying to make sure all metadata from index is loaded before
updating an existing asset
Exception in thread "pool-5-thread-6" java.lang.NullPointerException
	at org.apache.coyote.http11.InternalNioOutputBuffer.writeToSocket(InternalNioOutputBuffer.java:430)
	at org.apache.coyote.http11.InternalNioOutputBuffer.flushBuffer(InternalNioOutputBuffer.java:784)
	at org.apache.coyote.http11.InternalNioOutputBuffer.flush(InternalNioOutputBuffer.java:300)
	at org.apache.coyote.http11.Http11NioProcessor.action(Http11NioProcessor.java:1060)
	at org.apache.coyote.Response.action(Response.java:183)
	at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314)
	at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
	at org.apache.catalina.connector.Response.flushBuffer(Response.java:548)
	at org.apache.catalina.connector.ResponseFacade.flushBuffer(ResponseFacade.java:279)
	at org.granite.gravity.AbstractChannel.runReceived(AbstractChannel.java:251)
	at org.granite.gravity.AbstractChannel.runReceive(AbstractChannel.java:199)
	at org.granite.gravity.AsyncReceiver.doRun(AsyncReceiver.java:34)
	at org.granite.gravity.AsyncChannelRunner.run(AsyncChannelRunner.java:52)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)
{noformat}

and

{noformat}
14:40:18,382 [Low Memory Detector] WARN  com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager
-
  Ignoring timeout while attempting to flush zoie memory index to disk to free memory
proj.zoie.api.ZoieException: sync timed out
	at proj.zoie.impl.indexing.AsyncDataConsumer.syncWthVersion(AsyncDataConsumer.java:177)
	at proj.zoie.impl.indexing.AsyncDataConsumer.flushEvents(AsyncDataConsumer.java:155)
	at proj.zoie.impl.indexing.ZoieSystem.flushEvents(ZoieSystem.java:308)
	at com.ds.acm.engine.search.zoieimpl.core.ZoieSystemManager.onLowMemory(ZoieSystemManager.java:220)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.ds.util.event.BasicEventBroadcaster$Handler.invokeMethod(BasicEventBroadcaster.java:197)
	at com.ds.util.event.BasicEventBroadcaster$Handler.handle(BasicEventBroadcaster.java:190)
	at com.ds.util.event.BasicEventBroadcaster.fire(BasicEventBroadcaster.java:108)
	at com.ds.util.cache.LowMemoryWarningBroadcaster$1.handleNotification(LowMemoryWarningBroadcaster.java:135)
	at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
	at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
	at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
	at sun.management.Sensor.trigger(Sensor.java:120)
{noformat}

That 2nd exception happens a total of 9 times... and is rather
spooky.  What does it mean?  Ie, why is Zoie timing out on flushing
the index to disk, and, what does it then do w/ its RAMDir?

I also see alot of these:

{noformat}
15:50:18,856 [bigIndexBuilder_QueueProcessor_10] WARN  com.ds.acm.logic.impl.AssetManagerImpl
-
  Ignoring AssetNotFoundException trying to make sure all metadata from index is loaded before
  updating an existing asset
{noformat}

Would it be possible to instrument to Zoie code to note as the backup
process is copying each file in the snapshot, and at that point print
a listing of the directory?

Also, can you write to the log when Zoie applies deletes?  (Looks like
it happens in proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs).
It's on applying deletes that the corruption is first detected, so, if
we log this event we can better bracket the period of time when the
corruption happened.

Does Zoie ever open an IndexReader or IndexWriter passing in an
existing commit point?  Or does it always open the latest commit?

The timestamps on the zero length files are particularly spooky -- the
earliest ones are 15:21 (when first EOF is hit), but then also 15:47
and 15:49 on the others.  It seems like on 3 separate occasions
something truncated the files.


> Index corruption after 'read past EOF' under heavy update load and snapshot export
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2729
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Happens on both OS X 10.6 and Windows 2008 Server. Integrated with
zoie (using a zoie snapshot from 2010-08-06: zoie-2.0.0-snapshot-20100806.jar).
>            Reporter: Nico Krijnen
>         Attachments: 2010-11-02 IndexWriter infoStream log.zip, LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store for a CMS/DAM
system. We use the hot-backup feature of zoie to make scheduled backups of the index. This
works fine for small indexes and when there are not a lot of changes to the index when the
backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index is being
changed a lot (lots of document additions and/or deletions), we almost always get a 'read
past EOF' at some point, followed by lots of 'Lock obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the index is
unusable.
> When we stop our server, remove the 0kb files and restart our server, the index is operational
again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. Hopefully
someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex
- read past EOF
> java.io.IOException: read past EOF
>     at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
>     at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>     at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
>     at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
>     at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
>     at org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
>     at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
>     at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
>     at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader
- 
>     Problem copying segments: Lock obtain timed out: org.apache.lucene.store.SingleInstanceLock@5ad0b895:
write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.apache.lucene.store.SingleInstanceLock@5ad0b895:
write.lock
>     at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
>     at proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
>     at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
>     at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> {code}
> We get exactly the same behavour on both OS X and on Windows. On both zoie is using a
SimpleFSDirectory.
> We also use a SingleInstanceLockFactory (since our process is the only one working with
the index), but we get the same behaviour with a NativeFSLock.
> The snapshot backup is being made by calling:
> *proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)*
> Same issue in zoie JIRA:
> http://snaprojects.jira.com/browse/ZOIE-51

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message