lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: how to reasonably estimate the disk size for Lucene 4.x
Date Tue, 24 Mar 2015 15:22:21 GMT
Indexing a fraction of the data, such as 10% or 5%, is probably the best
way to do size estimation.

The only real caveat is that you also need to look at RAM as well. Most
modern hardware has huge mass storage capacity relative to the CPU
requirements for Lucene to process that data, while IT staffs tend to be
very, very stingy with RAM (or they give you big, fat nodes, but way too
few of them.) So even though most hardware easily has the disk space for 32
or 64 or 128 GB of index, getting that much RAM can be problematic,
especially when the IT staff has drunk heavily of the "hey, everything runs
great on commodity hardware!" Kool-Aid. IOW, running a 32GB index on a 16
GB box is probably not a great idea if you need low latency.

-- Jack Krupansky

On Tue, Mar 24, 2015 at 8:37 AM, Gaurav gupta <gupta.gaurav0125@gmail.com>
wrote:

> Erick,
> When further testing the index sizes using Lucene APIs (I am directing
> using Lucene not through Solr), I found that the index sizes are quite huge
> compare to the formula (I have attached the excel sheet). But one thing
> which I observe that the index sizes increases linearly w.r.t. no. of input
> records/documents, so can I convey customer to create index of 1M, 5M and
> 10M records and then extrapolate it for 250 M records. BAsically customer
> wants to do the capacity planning for disk etc. and thats why he us looking
> to some how reasonably predict the Lucene index size.
>
>
>
>    Lucene Index size calculation                # of Indexed Fields 11        #
> of Stored Fields 11                        *Note : *I am using standard
> Analyzer for all fields. I am indexing the records from a CSV file and each
> records is of size 0.2 KB ( size of each doc = total file size/no of
> records)                      Records(Million) Actual Index Size (MB) Size
> as per formula (for Optimize)          1 255 31.1897507   5 1361.92
> 75.9487534   10 2703.36 131.897507   25 7239.68 299.743767   50 15257.6
> 579.487534   75 26009.6 859.2313   100 32256 1138.97507   125 39526.4
> 1418.71883                    Thanks
>
> On Tue, Mar 10, 2015 at 9:08 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> In a word... no. There are simply too many variables here to give any
>> decent estimate.
>>
>> The spreadsheet is, at best, an estimate. It hasn't been put through
>> any rigorous QA so the fact that it's off in your situation is not
>> surprising. I wish we had a better answer.
>>
>> And the disk size isn't particularly interesting anyway. The *.fdt and
>> *.fdx files contain compressed copies of the raw data in _stored_
>> fields. If I index the same data with all fields set stored="true"
>> then stored="false", my disk size may vary by a large factor. And the
>> stored data has very little memory cost, memory usually being the
>> limiting factor in your Solr installation.
>>
>> Are you storing position information? Term vectors? Are you ngramming
>> your fields? and on and on. Each and every one of these changes the
>> memory requirements...
>>
>> Sorry we can't be more help
>> Erick
>>
>> On Mon, Mar 9, 2015 at 12:20 PM, Gaurav gupta
>> <gupta.gaurav0125@gmail.com> wrote:
>> > Could you please guide me how to reasonably estimate the disk size for
>> > Lucene 4.x (precisely 4.8.1 version) including worst case scenario.
>> >
>> > I have referred the formula and excel sheet shared @
>> >
>> https://lucidworks.com/blog/estimating-memory-and-storage-for-lucenesolr/
>> >
>> > I think it seems to be devised for Lucene 2.9. I am not sure if it's
>> hold
>> > true for 4.x version.
>> > In my case, either the actual index size is coming close to the worst
>> case
>> > or higher than that. Even, one of our enterprise customer has observed 3
>> > times higher index size than the estimated index size (based on excel
>> > sheet).
>> >
>> > Alternatively, can I know the average doc size in Lucene index (of a
>> > reasonable size of data) so that I can extrapolate that for complete 250
>> > million documents.
>> >
>> > Thanks
>> > Gaurav
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message