lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr facets implementation question
Date Tue, 22 Sep 2015 18:56:05 GMT
FWIW, there is work being done for "high cardinality faceting" with
some of the recent Streaming Aggregation code.

So it's at least on the way if not already there.

Erick

On Tue, Sep 22, 2015 at 11:44 AM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> adfel70 <adfel70@gmail.com> wrote:
>> Hi Toke, Thank you for the detailed explanation, thats exactly what I was
>> looking for, except this algorithm fit single index only. could you please
>> elaborate what adjustments are needed for distributed index?
>
> Vanilla Solr requests top-X terms from each shard, with over-provisioning. I do not remember
the exact formula (and I think it is adjustable in Solr 5), but something like X*1.5+10? Yes,
that means that correctness is not guaranteed for distributed faceting. It would be possible
to make some sort of streaming faceting implementation, but the pathological case is that
all shards must deliver all terms to derive the correct top-X.
>
> The results from the shards are merged and the top-X terms are fine-counted where needed:
If we have 3 shards and asked for top-1, they might answer
> shard1: [foo(3), zoo(1)]
> shard2: [foo(1), zoo(1)]
> shard3: [bar(2),aar(2)]
> (remember the over-provisioning). We derive that foo is the top-1 term, but since shard
3 did not provide a count for foo, we need to ask shard3 for the count for that specific term
to get the correct overall count.
>
> The fine-counting is performed differently from standard faceting. It is basically 'original_query
AND facet_field:fine_count_term'. Quite fast for a few terms, but if there is a need for resolving
tens or hundreds of terms for a non-trivial index, the fine-counting phase can take longer
than the initial faceting phase.
>
> - Toke Eskildsen
> (sorry for the delayed answer - my email reader hid your response)

Mime
View raw message