lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes
Date Tue, 20 Nov 2012 15:48:58 GMT


Simon Willnauer commented on LUCENE-4547:

Hey folks,

I thought about the IN-RAM vs. ON-Disk distinction we have in DV at this point and how we
distinguish API wise. IMO calling $TypeDocValues#newRAMInstance() with different behavior
if the instance is already a ram instance is kind of ugly API wise as well as having a binary
distinction here might not be sufficient. From my point of view it would logically make most
sense to allow the codec to decide if it is in ram or not or if only parts of the values are
in memory like in the sorted case where you might wanna use a FST holding a subset of the
values. Now giving the control entirely to the code might not be practical. Think about merging
where you really don't want to load into memory you should be able to tell don't pull into
memory. We can do this already today if we pass in IOContext. Yet, IOContext is the wrong
level since its a reader wide setting and might not be true for all fields in the case we
open a reader for handling searches. Yet, the idea of IOContext is basically to pass information
about the access pattern where merge means sequential access. We might want to use something
similar for docvalues that allows us to leave most of the decisions to the codec but if a
user decides he really needs stuff in memory he can still pass in something like AccessPattern.SEQUENTIAL
and load the values into an auxiliary datastructure. This would allow the codec to optimize
under the hood but not making any promises if it's in ram or on disk if AccessPattern.DEFAULT
is passed. 

> DocValues field broken on large indexes
> ---------------------------------------
>                 Key: LUCENE-4547
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 4.1
>         Attachments: test.patch
> I tried to write a test to sanity check LUCENE-4536 (first running against svn revision
1406416, before the change).
> But i found docvalues is already broken here for large indexes that have a PackedLongDocValues
> {code}
> final int numDocs = 500000000;
> for (int i = 0; i < numDocs; ++i) {
>   if (i == 0) {
>     field.setLongValue(0L); // force > 32bit deltas
>   } else {
>     field.setLongValue(1<<33L); 
>   }
>   w.addDocument(doc);
> }
> w.forceMerge(1);
> w.close();
> dir.close(); // checkindex
> {code}
> {noformat}
> [junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread
> [junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException:
> [junit4:junit4]   2> 	at __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
> [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(
> [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler$
> [junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65536
> [junit4:junit4]   2> 	at org.apache.lucene.util.ByteBlockPool.deref(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.DocValuesConsumer.merge(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.PerDocConsumer.merge(
> {noformat}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message