lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: docid is just a signed int32
Date Fri, 19 Aug 2016 03:50:41 GMT
OK, I'm a little out of my league here, but I'll plow on anyway....

bq: There are use cases out there where >2^31 does make sense in a single index

Ok, let's put some definition to this and define the use-case
specifically rather than
be vague. I've just run an experiment for instance where I had 200M
docs in a single
shard (very small docs) and tried to sort by a date on all of them.
Performance on the order of
5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
Faceting? If
so the performance will probably be poor.

This would be huge surgery I believe, and there hasn't been a
compelling use-case
in the search world for it. Unless and until that case is made I
suspect this idea will
meet with a lot of resistance.

That said, I do understand that this is somewhat akin to "Nobody will
ever need more
than 64K of ram", meaning that some limits are assumed and eventually become
outmoded. But given Java's issues with memory and GC I suspect that
it'll be really
hard to justify the work this would take.

FWIW,
Erick


On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <trejkaz@trypticon.org> wrote:
> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <jpountz@gmail.com> wrote:
>> No, IndexWriter enforces that the number of documents cannot go over
>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>> BaseCompositeReader computes the number of documents in a long variable and
>> ensures it is less than 2^31, so you cannot have indexes that contain more
>> than 2^31 documents.
>>
>> Larger collections should be written to multiple shards and use
>> TopDocs.merge to merge results.
>
> But hang on:
> * TopDocs#merge still returns a TopDocs.
> * TopDocs still uses an array of ScoreDoc.
> * ScoreDoc still uses an int doc ID.
>
> Looks like you're still screwed.
>
> I wish IndexReader would use long IDs too, because one IndexReader can
> be across multiple shards too - it doesn't make much sense to me that
> this is restricted, although "it's hard to fix in a
> backwards-compatible way" is certainly a good reason. :D
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message