lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Loading data to FieldValueCache
Date Fri, 26 Dec 2014 17:26:06 GMT
Manohar:

Please approach this cautiously. You state that you have "hundreds of states".
Every 100 states will use roughly 1.2G of your filter cache. Just for this
field. Plus it'll fill up the cache and they may soon be aged out anyway.
Can you really afford the space? Is it really a problem that needs to be
solved at this point? This _really_ sounds like premature optimization
to me as you haven't
demonstrated that there's an actual problem you're solving.

OTOH, of course, if you're experimenting to better understand all the
ins and outs
of the process that's another thing entirely ;)....

Toke:

I don't know the complete algorithm, but if the number of docs that
satisfy the fq is "small enough",
then just the internal Lucene doc IDs are stored rather than a bitset.
What exactly "small enough" is
I don't know off the top of my head. And I've got to assume looking
stuff up in a list is slower than
indexing into a bitset so I suspect "small enough" is very small....

On Fri, Dec 26, 2014 at 3:00 AM, Manohar Sripada <manohar211@gmail.com> wrote:
> Thanks Toke for the explanation, I will experiment with
> f.state.facet.method=enum
>
> Thanks,
> Manohar
>
> On Fri, Dec 26, 2014 at 4:09 PM, Toke Eskildsen <te@statsbiblioteket.dk>
> wrote:
>
>> Manohar Sripada [manohar211@gmail.com] wrote:
>> > I have 100 million documents in my index. The maxDoc here is the maximum
>> > Documents in each shard, right? How is it determined that each entry will
>> > occupy maxDoc/8 approximately.
>>
>> Assuming that it is random whether a document is part of the result set or
>> not, the most efficient representation is 1 bit/doc (this is often called a
>> bitmap or bitset). So the total number of bits will be maxDoc, which is the
>> same as maxDoc/8 bytes.
>>
>> Of course, result sets are rarely random, so it is possible to have other
>> and more compact representations. I do not know how that plays out in
>> Lucene. Hopefully somebody else can help here.
>>
>> > If I have to add facet.method=enum every time in the query, how should I
>> > specify for each field separately?
>>
>> f.state.facet.method=enum
>>
>> See https://wiki.apache.org/solr/SimpleFacetParameters#Parameters
>>
>> - Toke Eskildsen
>>

Mime
View raw message