lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Farber <pfar...@umich.edu>
Subject Index size vs. number of documents
Date Wed, 13 Aug 2008 17:45:22 GMT

We're indexing the ocr for a large number of books.  Our experimental 
schema is simple and id field and an ocr text field (not stored).

Currently we just have two data points:

3005 documents = 723 MB index
174237 documents = 51460 MB index

These indexes are not optimized.

If the index size were a linear function of number of documents, based 
on just these two data points, you'd expect the index for 174237 docs to 
be approximately 57.98 times larger that 723 MB or about 41921 MB. 
Actually it's 51460 or about 22% bigger.

I suspect the non-linear increase is due to dirty ocr that continually 
increases the number of unique words that need to be indexed.

Another possibility is that the larger index has a higher proportion of 
documents containing characters from non-Latin alphabets thereby 
increasing the number of unique words.  I can't verify that at this point.

Are these reasonable assumptions or am I missing other factors that 
could contribute to the non-linear growth in index size?

Regards,

Phil


Mime
View raw message