lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Cost of enabling doc values
Date Thu, 14 Jun 2018 16:05:16 GMT
My claim is it simply doesn't matter. You either have to have those
bytes laying around on disk in the DV case and using OS memory or in
the cumulative java heap in the non-dv case.

If you're doing one of the three operations I know of no situation
where I would _not_ enable docValues.

The Lucene people do a lot of effort to make things compact, so what
you're coming up with is probably an upper bound. Frankly I'd just
enable the DV fields, index a bunch of docs and look at the cumulative
sizes of your dvd and dvm files.

I'd probably index, say, 10M docs and measure the two extensions, then
index 10M more and use the delta between 10M and 20M to extrapolate.

I also use the size of those files to get something of a sense of how
much OS memory I need for those operations (searching not included
yet). Gives me a sense of whether what I want to do is possible or
not.

Long blog on the topic of sizing, but it sums up as "try it and see":

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Thu, Jun 14, 2018 at 8:34 AM, root23 <s.manuj545@gmail.com> wrote:
> Thanks for the detailed explanation erick.
> I did a little math as you suggested. Just wanted to see if i am doing it
> right.
> So we have around 4 billion docs in production and around 70 nodes.
>
> To support the business use case we have around 18 fields on which we have
> to enable docvalues for sorting.
>
> FieldType   totalFields   Size of field
> TriIntField    2               4 bytes
> StrField       7                20 bytes
> IntField        1                4 bytes
> Bool              1              1 bytes
> TrieDateField  2             10 bytes
> TextField        5             10 bytes
>
>
> Some of them i approximated the bytes like fot strField and textField based
> on no. of chatacters we usually have in those fields. I am not sure about
> the TrieDate field how much it will take. Please feel free to correct me if
> i am way off.
>
> so acc. to the above total size for a doc is = 2*4 + 20 *7 + 4 + 1+20+50 =
> 223 bytes.
>
> So for 4 billion docs it comes to approximate 892000000000 bytes or 892 gb.
>
> Does that math sound right or am i way off ?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Mime
View raw message