lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Documenting document limits for Lucene and Solr
Date Thu, 31 May 2012 17:30:20 GMT
Deleted documents use IDs, so you may run out of doc IDs with fewer than 2^31 searchable documents.

I recommend designing with a lot of slack, maybe using only 75% of IDs. Solr might alert when
90% of the space is used.

If you want to delete everything, then re-add everything without a commit, you will use 2X
the doc IDs. That isn't even worst case.

If you reduce or black-out merging, you can end up with serious doc ID consumption.

With no merges, if you find lots of near-dupes and routinely replace documents with a better
version, you can have many deleted documents for each searchable one. This can happen with
web spidering. If you find five mirrors of a million-document site, and find the best one
last, you can use five million doc IDs for those million docs.

wunder

On May 30, 2012, at 8:52 AM, Jack Krupansky wrote:

> AFAICT, there is no clear documentation of the maximum number of documents that can be
stored in a Lucene or Solr Index (single core/shard). It appears to be 2^31 since a Lucene
document number and the value returned from IW.maxDoc is a Java “int”. Lucene users have
that “hint” to guide them, but that hint is never surfaced for Solr users, AFAICT. A few
years ago nobody in their right mind would imagine indexing 2 billion documents in a single
machine/core, but now people are at least tempted to try. So, it is now more important for
people to know about it, up front, not hidden down in the fine print of Lucene file formats.
>  
> I wanted to file a Jira on this, but I wanted to check first if anybody knows of an existing
Jira for it that maybe was worded in a way that it escaped my semi-diligent searches.
>  
> I was also thinking of filing it as two Jiras, one for Lucene and one for Solr since
the doc would be in different places. Or, should there be one combined “Lucene/Solr Capacity
Limits/Planning” wiki? Unless somebody objects, I’ll file as two separate (but linked)
issues.
>  
> And, I was also thinking of filing two Jiras for Lucene and Solr to each have a robust
check for exceeding the underlying Lucene limit and reporting this exception in a well-defined
manner rather than “numFound” or “maxDoc” going negative. But this is separate from
the documentation issue, I think. Unless somebody objects, I’ll file these as two separate
issues.
>  
> Any objection to me filing these four issues?
> 
> -- Jack Krupansky





Mime
View raw message