Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Message-ID: <1448526745.9730.250.camel@te-prime>
Subject: Re: [Faceting] Exact Value Faceting VS ID Faceting
From: Toke Eskildsen <te@statsbiblioteket.dk>
Reply-To: <te@statsbiblioteket.dk>
To: <solr-user@lucene.apache.org>
Date: Thu, 26 Nov 2015 09:32:25 +0100
In-Reply-To: 
 <CAB-fSbwMpb6VJpvbZs8qcuouDvpeqk9wtaGJrx4YHinpR-T4yg@mail.gmail.com>
References: 
 <CAB-fSbwMpb6VJpvbZs8qcuouDvpeqk9wtaGJrx4YHinpR-T4yg@mail.gmail.com>
Organization: State and University Library, Denmark
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Wed, 2015-11-25 at 15:56 +0000, Alessandro Benedetti wrote:
> I would like to have docValues because facets are going to be heavy on
> those fields.

> *Faceting approach *
> *1) *Indexing the human readable field value

Technically this will be a SORTED or SORTED_SET, which again means that
a pool of terms is maintained for each segment. The mapping from
documents to terms are done using ordinals, which are not comparable
across segments.

> Facets will be returned readable, out of the box.
> I can not see any cons in this approach, I would say it is the standard one.

With multiple segments, the terms from each segment must somehow be
aligned, do avoid duplicate entries in the result. This can either be
done be creating a segment_ordinal->global_ordinal map upon first
faceting call (facet.method=fc) or by on-the-fly comparison of top-X
terms from each segment (facet.method=fcs). Either way, there is a
performance penalty.

>    - When calculating faceting, in memory it is used the ordinal for each
>    term, which means in memory we don't waste space for the actual term, or
>    waste the time looking up for the value until the very end of the process,
>    after the counts are done .

The segment_ordinal->global_ordinal requires memory linear to the number
of unique values in the field. If fcs is used, there will be more term
lookups.

> *2)* Correlate outside the search system each term to a custom ID. Index
> the custom ID. After facets are calculated resolve the ID and show the
> human readable labels.

Assuming the ID is an integer (about the only thing that makes sense),
this ensures that the IDs are comparable across segments, so no
segment->global mapping is needed. This removes the performance penalty
described above and is (as far as I understand) the principle behind
Lucene faceting.

On the other hand, this approach is indeed more complicated and it
introduces another hotspot both for indexing (as document construction
requires a lookup in the term provider) and searching (for resolving the
final terms).


If we had a hashing method String->long and guaranteed that there would
be no collisions (or we accepted the occasional faulty result), then we
could avoid the segment->global map as well as the centralized term
server. To my knowledge, this has not yet been attempted.

- Toke Eskildsen