lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian chen <>
Subject Re: maximum number of documents
Date Wed, 12 Oct 2005 16:50:44 GMT
Hi, Koji,

I think you are right, the max num of documents should be Integer.MAX_VALUE.

Some more points below:

1) I double checked the Lucene documentation. It mentioned in the file
format that SegSize is UInt32. I don't think this is accurate, as UInt32 is
around 4 billion, but Integer.MAX_VALUE is half of that, around 2 billion.

In java, there is no notion of unsigned integer, so, since Lucene uses
integer to store doc ids, the max you can get is therefore 2 billion.

Maybe the documentation could mention it in more detail? Specifically, the
actual max number of a document id 2147483647 could be mentioned?

2) I think in theory, if you index 8 billion docs, you can use 4 indexes,
and when you do the search, just search all 4 indexes and combine the result

3) Looking at the Lucene source code, it seems not that difficult to change
the doc id to use Long instead. It occurs to me that the OutputStream's
writeVInt and writeVLong are using exactly the same code. So, there should
be no performance penalty to switch to using Long.

4) However, if you have 8 billion to index, just changing doc id to use Long
is not enough I guess. You may also need to adjust other parameters, such as
the IndexInterval (for storing the term info index). Because the term info
index (tii) is loaded into memory totally, so, instead of leaving it as 128,
you may have to change it to 256 or bigger, to avoid out of memory issue.



On 10/12/05, Koji Sekiguchi <> wrote:
> Hello,
> Is the maximum number of documents in an index Integer.MAX_VALUE? (approx
> 2
> billion)
> If so, if I want to have 8 billion docs indexed, like Google,
> can I do it with having four indices, theoretically?
> Koji
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message