lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang, Lisheng" <Lisheng.Zh...@BroadVision.com>
Subject RE: How to handle more than Integer.MAX_VALUE documents?
Date Wed, 03 Nov 2010 06:09:22 GMT
Hi,

Thanks very much for your helps! 

Your point is well taken and it may cover most use cases, but it seems 
to me that in principle the limit is not just for one segment: suppose
within one index we have 3 segments and each has docs close to 2^31-1,
then if I need to loop through most docs in all three segments we would
still have problems?

The use case is (rare one): if user searched a word which is in most 
docs and we use pagination, and user somehow just wants to get last a 
few pages (lowest rank), then we have to use a large nDocs to call search 
(may go beyond Integer.INTEGER_MAX).

Best regards, Lisheng

-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com]
Sent: Tuesday, November 02, 2010 7:00 PM
To: java-user@lucene.apache.org; simon.willnauer@gmail.com
Subject: Re: How to handle more than Integer.MAX_VALUE documents?


You would have to control your MergePolicy so it doesn't collapse
everything back to one segment.

On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
<simon.willnauer@googlemail.com> wrote:
> On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog <goksron@gmail.com> wrote:
>> 2billion is a hard limit. Usually people split indexes into multiple
>> index long before this, and use the parallel multi reader (I think) to
>> read from all of the sub-indexes.
>>
>> On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
>> <Lisheng.Zhang@broadvision.com> wrote:
>>>
>>> Hi,
>>>
>>> Now lucene uses integer as document id, so it means we cannot have more
>>> than 2^31-1 documents within one collection? Even if we use MultiSearcher
>>> the document id is still integer so it seems this is still a problem?
>
> This is really the limit of a segment, I think you can write you own
> collector and collect documents which higher (absolute) doc ids than
> INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
> should really rethink the way your search works and apply some
> sharding techniques. I really haven't been up to that many docs in a
> single index but I think it should work to have multiple segments with
> INT_MAX documents in it since we search on segment level provided if
> you collector supports it.
>
> simon
>>>
>>> We have been using lucene for some time and our document count is growing
>>> rather rapidly, maybe this is a much-discussed issue already, but I did not
>>> find the lead, any pointer would be really appreciated.
>>>
>>> Thanks very much for helps, Lisheng
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message