lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4547) DocValues field broken on large indexes
Date Thu, 15 Nov 2012 21:35:12 GMT


Robert Muir commented on LUCENE-4547:

These are hard questions. My personal goal here for this prototype (currently SimpleText only!)
was to:

1. Making merging use (significantly) less RAM, to fix this bug.
2. Make it easier to write docvalues codecs, to encourage innovations (e.g. FST impls, etc
3. Simplify the types to make it easier on the user.

the consumer api I think is simpler (part of #2), but I would like to (in the future) simplify
the producer API too.
I'm not sure if we should do it here though? anyway we can think about the issues you raised
one by one and do them separately on their own issues.

fix other issues such as LUCENE-3862?

Its my opinion we should do this sooner than later.

merge the FieldCache / FunctionValues / DocValues.Source APIs?

This really needs to be addressed, but I think not here. Its horrific that algorithms like
grouping, sorting, and maybe faceting have to be duplicated for 2 different things (fieldcache
and docvalues).

are you going to remove DocValues.Type.FLOAT_*?

I think the 3 types we have here are enough. Someone can do a float or double type "on top
of" the "number" type we have.
Lucene is already doing this today: look at norms. I think lucene should just have a number
type that stores bits.

are SimpleDVConsumer and SimpleDocValuesFormat going to replace PerDocConsumer and DocValuesFormat?

This is the idea, once we are happy with the APIs we would implement the 4.0 ones with these

are you going to remove hasArray/getArray?

I don't care about this. I am unsure similarity impls should be calling this though, definitely
at least
it would be better for them to fall-back: I just cant bring myself to fix it until LUCENE-3862
is fixed :)

will there still be a direct=true|false option at load-time or will it depend on the format
impl (potentially with a PerFieldPerDocProducer similarly to the postings formats)?

I don't want to change this in the branch. Personally i feel like a codec/segmentreader/etc
should generally only manage
direct, producer exposing the same "stats" (minimum, maximum, fixed, whatever) that the consumer
apis get (which will also make merging more efficient!) default source impl can be something
nice, read the direct impl into a packed ints,
and so on. Codec could override to e.g. just slurp in their on-disk packed ints directly.
So codec still has control
of the in-memory RAM representation, i think this is important. But i think codec and segmentreader
should somehow not
be in control of caching: this should be elsewhere (

> DocValues field broken on large indexes
> ---------------------------------------
>                 Key: LUCENE-4547
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 4.1
>         Attachments: test.patch
> I tried to write a test to sanity check LUCENE-4536 (first running against svn revision
1406416, before the change).
> But i found docvalues is already broken here for large indexes that have a PackedLongDocValues
> {code}
> final int numDocs = 500000000;
> for (int i = 0; i < numDocs; ++i) {
>   if (i == 0) {
>     field.setLongValue(0L); // force > 32bit deltas
>   } else {
>     field.setLongValue(1<<33L); 
>   }
>   w.addDocument(doc);
> }
> w.forceMerge(1);
> w.close();
> dir.close(); // checkindex
> {code}
> {noformat}
> [junit4:junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread
> [junit4:junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException:
> [junit4:junit4]   2> 	at __randomizedtesting.SeedInfo.seed([5DC54DB14FA5979]:0)
> [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(
> [junit4:junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler$
> [junit4:junit4]   2> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65536
> [junit4:junit4]   2> 	at org.apache.lucene.util.ByteBlockPool.deref(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl$FixedBytesWriterBase.set(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.writePackedInts(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.lucene40.values.PackedIntValues$PackedIntsWriter.finish(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.DocValuesConsumer.merge(
> [junit4:junit4]   2> 	at org.apache.lucene.codecs.PerDocConsumer.merge(
> {noformat}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message