Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 437D3187C3 for ; Thu, 26 Nov 2015 08:33:35 +0000 (UTC) Received: (qmail 89477 invoked by uid 500); 26 Nov 2015 08:33:32 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 89420 invoked by uid 500); 26 Nov 2015 08:33:32 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 89409 invoked by uid 99); 26 Nov 2015 08:33:31 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Nov 2015 08:33:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4A170C08AC for ; Thu, 26 Nov 2015 08:33:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.446 X-Spam-Level: X-Spam-Status: No, score=0.446 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.554] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id nR3B0o40tgK8 for ; Thu, 26 Nov 2015 08:33:27 +0000 (UTC) Received: from sbexch04.sb.statsbiblioteket.dk (sbexch04.sb.statsbiblioteket.dk [130.225.24.70]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id AB7692050F for ; Thu, 26 Nov 2015 08:33:26 +0000 (UTC) Received: from sbexch04.sb.statsbiblioteket.dk (130.225.24.70) by sbexch04.sb.statsbiblioteket.dk (130.225.24.70) with Microsoft SMTP Server (TLS) id 15.0.1076.9; Thu, 26 Nov 2015 09:33:19 +0100 Received: from [130.225.25.26] (130.225.25.26) by sbexch04.sb.statsbiblioteket.dk (130.225.24.70) with Microsoft SMTP Server id 15.0.1076.9 via Frontend Transport; Thu, 26 Nov 2015 09:33:19 +0100 Message-ID: <1448526745.9730.250.camel@te-prime> Subject: Re: [Faceting] Exact Value Faceting VS ID Faceting From: Toke Eskildsen Reply-To: To: Date: Thu, 26 Nov 2015 09:32:25 +0100 In-Reply-To: References: Organization: State and University Library, Denmark Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit On Wed, 2015-11-25 at 15:56 +0000, Alessandro Benedetti wrote: > I would like to have docValues because facets are going to be heavy on > those fields. > *Faceting approach * > *1) *Indexing the human readable field value Technically this will be a SORTED or SORTED_SET, which again means that a pool of terms is maintained for each segment. The mapping from documents to terms are done using ordinals, which are not comparable across segments. > Facets will be returned readable, out of the box. > I can not see any cons in this approach, I would say it is the standard one. With multiple segments, the terms from each segment must somehow be aligned, do avoid duplicate entries in the result. This can either be done be creating a segment_ordinal->global_ordinal map upon first faceting call (facet.method=fc) or by on-the-fly comparison of top-X terms from each segment (facet.method=fcs). Either way, there is a performance penalty. > - When calculating faceting, in memory it is used the ordinal for each > term, which means in memory we don't waste space for the actual term, or > waste the time looking up for the value until the very end of the process, > after the counts are done . The segment_ordinal->global_ordinal requires memory linear to the number of unique values in the field. If fcs is used, there will be more term lookups. > *2)* Correlate outside the search system each term to a custom ID. Index > the custom ID. After facets are calculated resolve the ID and show the > human readable labels. Assuming the ID is an integer (about the only thing that makes sense), this ensures that the IDs are comparable across segments, so no segment->global mapping is needed. This removes the performance penalty described above and is (as far as I understand) the principle behind Lucene faceting. On the other hand, this approach is indeed more complicated and it introduces another hotspot both for indexing (as document construction requires a lookup in the term provider) and searching (for resolving the final terms). If we had a hashing method String->long and guaranteed that there would be no collisions (or we accepted the occasional faulty result), then we could avoid the segment->global map as well as the centralized term server. To my knowledge, this has not yet been attempted. - Toke Eskildsen