lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav gupta <gupta.gaurav0...@gmail.com>
Subject Re: how to reasonably estimate the disk size for Lucene 4.x
Date Tue, 24 Mar 2015 12:37:17 GMT
Erick,
When further testing the index sizes using Lucene APIs (I am directing
using Lucene not through Solr), I found that the index sizes are quite huge
compare to the formula (I have attached the excel sheet). But one thing
which I observe that the index sizes increases linearly w.r.t. no. of input
records/documents, so can I convey customer to create index of 1M, 5M and
10M records and then extrapolate it for 250 M records. BAsically customer
wants to do the capacity planning for disk etc. and thats why he us looking
to some how reasonably predict the Lucene index size.



   Lucene Index size calculation                # of Indexed Fields 11        #
of Stored Fields 11                        *Note : *I am using standard
Analyzer for all fields. I am indexing the records from a CSV file and each
records is of size 0.2 KB ( size of each doc = total file size/no of
records)                      Records(Million) Actual Index Size (MB) Size
as per formula (for Optimize)          1 255 31.1897507   5 1361.92
75.9487534   10 2703.36 131.897507   25 7239.68 299.743767   50 15257.6
579.487534   75 26009.6 859.2313   100 32256 1138.97507   125 39526.4
1418.71883                    Thanks

On Tue, Mar 10, 2015 at 9:08 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> In a word... no. There are simply too many variables here to give any
> decent estimate.
>
> The spreadsheet is, at best, an estimate. It hasn't been put through
> any rigorous QA so the fact that it's off in your situation is not
> surprising. I wish we had a better answer.
>
> And the disk size isn't particularly interesting anyway. The *.fdt and
> *.fdx files contain compressed copies of the raw data in _stored_
> fields. If I index the same data with all fields set stored="true"
> then stored="false", my disk size may vary by a large factor. And the
> stored data has very little memory cost, memory usually being the
> limiting factor in your Solr installation.
>
> Are you storing position information? Term vectors? Are you ngramming
> your fields? and on and on. Each and every one of these changes the
> memory requirements...
>
> Sorry we can't be more help
> Erick
>
> On Mon, Mar 9, 2015 at 12:20 PM, Gaurav gupta
> <gupta.gaurav0125@gmail.com> wrote:
> > Could you please guide me how to reasonably estimate the disk size for
> > Lucene 4.x (precisely 4.8.1 version) including worst case scenario.
> >
> > I have referred the formula and excel sheet shared @
> >
> https://lucidworks.com/blog/estimating-memory-and-storage-for-lucenesolr/
> >
> > I think it seems to be devised for Lucene 2.9. I am not sure if it's hold
> > true for 4.x version.
> > In my case, either the actual index size is coming close to the worst
> case
> > or higher than that. Even, one of our enterprise customer has observed 3
> > times higher index size than the estimated index size (based on excel
> > sheet).
> >
> > Alternatively, can I know the average doc size in Lucene index (of a
> > reasonable size of data) so that I can extrapolate that for complete 250
> > million documents.
> >
> > Thanks
> > Gaurav
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
View raw message