lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: maxDoc/numDocs int fields
Date Fri, 21 Mar 2014 17:29:09 GMT
Every word occurrence or every unique word? I mean Integer.MAX_VALUE like 2 
billion. Even the OED only has 600,000 words defined. The former doesn't 
sound like a good use case match for Lucene as it exists today. Lucene 
indexes "documents", not "words".

I'm sure some day Lucene will switch from int to long, but not in the very 
near future (maybe Lucene 6.0??), especially since it probably isn't a good 
match for existing hardware. Maybe when Lucene moves a lot more stuff off 
heap, then it might make more sense.

Sure, you could do you own Lucene branch that literally does that switch 
now, but otherwise, that's the limit for now.

-- Jack Krupansky

-----Original Message----- 
From: Artem Gayardo-Matrosov
Sent: Friday, March 21, 2014 12:41 PM
Subject: Re: maxDoc/numDocs int fields

Hi Oli,

Thanks for your reply,

I thought about this, but it feels like making a crude, inefficient
implementation of what's already in lucene -- CompositeReader, isn't it? It
would involve writing my CompositeCompositeReader which would forward the
requests to the underlying CompositeReader...

Is there a better way?


On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ <> wrote:

> Can you split your corpus across multiple Lucene instances?
> Cheers, Oli
> -----Original Message-----
> From: Artem Gayardo-Matrosov []
> Sent: Friday, March 21, 2014 12:29 PM
> To:
> Subject: maxDoc/numDocs int fields
> Hi all,
> I am using lucene to index a large corpus of text, with every word being a
> separate document (this is something I cannot change), and I am hitting a
> limitation of the CompositeReader only supporting Integer.MAX_VALUE
> documents.
> Is there any way to work around this limitation? For the moment I have
> implemented my own DirectoryReader and BaseCompositeReader to at least 
> make
> them support documents from Integer.MIN_VALUE to -1 (for twice more
> documents supported), the problem is that all the APIs are restricted to
> use the int type and after the docID value wraps back to 0, I have no way
> to restore the original docID.
> --
> Thanks in advance,
> Artem.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message