lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: In Place Updates: Can we filter on fields with only docValues="true"
Date Sun, 15 Sep 2019 16:06:32 GMT
Filtering is really searching. As Shawn says, you _might_ get away with it in some circumstances,
but it’s not something I’d recommend.

Here’s the problem: For most searches, you’re trying to ask “for term X, what docs contain
it?”. That’s exactly what the inverted index is for, it’s an ordered list of terms,
each term has the list of documents it appears in.

DocValues is the exact opposite. It answers “For doc X, what is the value of field Y?”.
When _searching_ on a DV only field, think “table scan” in DB terms.

Pick a field with high cardinality. Worst-case, every doc has a unique value and try searching
on that. If it’s fast, then I need to go into the code and understand why it’s not doing
what I expect ;).

I’ll add parenthetically that 100M docs with 100 shards seems excessively sharded. Perhaps
you have so many fields that that’s warranted, but it seems high. My rule-of-thumb starting
place is 50M docs/shard. Admittedly that can be low or high, I’ve seen 300M docs fit in
12G and 10M docs strain 31G. You might try testing a node to destruction, see: https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

> On Sep 14, 2019, at 7:54 PM, Shawn Heisey <elyograg@elyograg.org> wrote:
> 
> On 9/14/2019 4:29 PM, Mikhail Khludnev wrote:
>> Shawn, would you mind to provide some numbers?
>> I'm experimenting with lucene 8.0.0.
>> I have 100 shard index of 100M docs with 2000 docVals only updateable
>> fields. Searching for such field turns to be blazingly fast
>> $ curl 'localhost:39200/books/_search?pretty&size=20' -d '
> 
> I have no idea how to read the json you've pasted.  Neither that or the URLs look like
Solr.
> 
>> I've just updated this field in this particular doc. Other 245K of 100M
>> docs has 1 in it
>> $ curl -H 'Content-Type:application/json'
> 
> <snip>
> 
>> It's dv field without index
>> $ curl -s
>> 'localhost:39200/books/_mapping/field/subscription_0x1?pretty&include_defaults=true'
> 
> What's the cardinality of the field you're searching on?  If it's small, then even an
inefficient search will be fast.  Try on a field with millions or billions of possible values.
> 
> Thanks,
> Shawn


Mime
View raw message