Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 04A5A200BD0 for ; Wed, 30 Nov 2016 10:11:24 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 031F4160B13; Wed, 30 Nov 2016 09:11:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F290A160B08 for ; Wed, 30 Nov 2016 10:11:22 +0100 (CET) Received: (qmail 36160 invoked by uid 500); 30 Nov 2016 09:11:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36148 invoked by uid 99); 30 Nov 2016 09:11:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Nov 2016 09:11:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D240B180226 for ; Wed, 30 Nov 2016 09:11:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.43 X-Spam-Level: * X-Spam-Status: No, score=1.43 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id IqDFZNJZ6kEk for ; Wed, 30 Nov 2016 09:11:15 +0000 (UTC) Received: from mail-qt0-f175.google.com (mail-qt0-f175.google.com [209.85.216.175]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 40CC75FC17 for ; Wed, 30 Nov 2016 09:11:15 +0000 (UTC) Received: by mail-qt0-f175.google.com with SMTP id p16so181348426qta.0 for ; Wed, 30 Nov 2016 01:11:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=UFSvQro1DKDc45AluEOu3rMKrwFNwf8o7RGoG+a8jV0=; b=buZiox0b4o2PEKq66jUe16Ul5NoqFd5yPNDHl19+NC3+f4ILgZXS7mHu2y/uC3M9Bl 8/LAwMEBgaxPqwHXLT1Ma7dp46yIbQJkGWf0OnL8Q3GxK330pytj28xak+BwO+/JXya6 /+D9FKhyBC/kfXJVj4RqTLo9oqRKMG2fKgEk41XyG5/sL0wpRrHsGTV39mr6FaDmq1QS 5e2sopEJYLjqt+O5fd+HBi1SaE3kYUkh+9nxAflRl9NqDZVlkW4vkbGdC31MuXe6qR/p Q5gjUoJhXG+BmUBFZjp925rdplU7clU7aoO9q+X9DctyUe5MG7kWu8ksomeFQe//IpEk cQ4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=UFSvQro1DKDc45AluEOu3rMKrwFNwf8o7RGoG+a8jV0=; b=PjEM5R+kD0Xqk0VykSVufIBcr/pVWXKkIHg03Fm8KBlm6nBxeHsvbx2wv88ZJGBF9T rZvhD/WGKtsm/JRXCCOa24O4+/FhPbVJQczSM8FmT6gmK7GsB/0YJEjrnT9P7mnGORQ1 iTrthvgGd61Zcx1QPPRSsVB8CJ99zwBT2WM/mJKaKFgEM66Qgn2FTGC+pQvKXXM8w7HB PldO4UsmvHv5mO04TGnsglOkDoMnOCVOeQRfYCFMRV+SUn4tm7Zve5Gi4HVsichQrh57 DzBnCmSqyqZysdcxzJWs+NU6vFS7rmSQYxUbsQox6+1Xe1u98KvqgEU13vpdS7p5jgYs ODCA== X-Gm-Message-State: AKaTC02W/tNL2PXgUq8r7F71zuEQoc6G5oGaf/NqSbkL9gYrviO+vV2zi5JW3mXezS0hwQEt01HjVFeHRJBn5A== X-Received: by 10.237.62.27 with SMTP id l27mr27186689qtf.34.1480497074239; Wed, 30 Nov 2016 01:11:14 -0800 (PST) MIME-Version: 1.0 Received: by 10.12.168.1 with HTTP; Wed, 30 Nov 2016 01:10:53 -0800 (PST) In-Reply-To: References: From: Chitra R Date: Wed, 30 Nov 2016 14:40:53 +0530 Message-ID: Subject: Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ? To: Lucene Users , serera@gmail.com Content-Type: multipart/alternative; boundary=001a113a268633821305428116f3 archived-at: Wed, 30 Nov 2016 09:11:24 -0000 --001a113a268633821305428116f3 Content-Type: text/plain; charset=UTF-8 Thank you so much, Shai... Chitra On Wed, Nov 30, 2016 at 2:17 PM, Shai Erera wrote: > This feature is not available in Lucene currently, but it shouldn't be hard > to add it. See Mike's comment here: > http://blog.mikemccandless.com/2013/05/dynamic-faceting- > with-lucene.html?showComment=1412777154420#c363162440067733144 > > One more tricky (yet nicer) feature would be to have it all in one go, i.e. > you'd say something like "facet on field price" and you'd get "interesting" > buckets, per the variance in the results. > > But before that, we could have a StatsFacets in Lucene which provide some > statistics about a numeric field (min/max/avg etc.). > > On Wed, Nov 30, 2016 at 7:50 AM Chitra R wrote: > > > Thank you so much, mike... Hope, gained a lot of stuff on Doc > > Values faceting and also clarified all my doubts. Thanks..!! > > > > > > *Another use case:* > > > > After getting matching documents for the given query, Is there any way to > > calculate mix and max values on NumericDocValuesField ( say date field)? > > > > > > I would like to implement it in numeric range faceting by splitting the > > numeric values (getting from resulted documents) into ranges. > > > > > > Chitra > > > > > > On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless < > > lucene@mikemccandless.com> wrote: > > > > > Doc values fields are never loaded into memory; at most some small > > > index structures are. > > > > > > When you use those fields, the bytes (for just the one doc values > > > field you are using) are pulled from disk, and the OS will cache them > > > in memory if available. > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > > > > On Mon, Nov 28, 2016 at 6:01 AM, Chitra R > wrote: > > > > Hi, > > > > When opening SortedSetDocValuesReaderState at search time, > > > whether > > > > the whole doc value files (.dvd & .dvm) information are loaded in > > memory > > > or > > > > specified field information(say $facets field) alone load in memory? > > > > > > > > > > > > > > > > > > > > Any help is much appreciated. > > > > > > > > > > > > Regards, > > > > Chitra > > > > > > > > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R > > wrote: > > > >> > > > >> > > > >> Kindly post your suggestions. > > > >> > > > >> Regards, > > > >> Chitra > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R > > > wrote: > > > >>> > > > >>> Hey, I got it clearly. Thank you so much. Could you please help us > to > > > >>> implement it in our use case? > > > >>> > > > >>> > > > >>> In our case, we are having dynamic index and it is variable depth > > too. > > > So > > > >>> flat facet is enough.No need of hierarchical facets. > > > >>> > > > >>> What I think is, > > > >>> > > > >>> Index my facet field as normal doc value field, so that no special > > > >>> operation (like taxonomy and sorted set doc values facet field) > will > > > be done > > > >>> at index time and only doc value field stores its ordinals in their > > > >>> respective field. > > > >>> At search time, I will pass query (user search query) , filter > (path > > > >>> traversed list) and collect the matching documents in > > Facetscollector. > > > >>> To compute facet count for the specific field, I will gather those > > > >>> resulted docs, then move through each segment for collecting the > > > matching > > > >>> ordinals using AtomicReader. > > > >>> > > > >>> > > > >>> And know when I use this means, can't calculate facet count for > more > > > than > > > >>> one field(facet) in a search. > > > >>> > > > >>> Instead of loading all the dimensions in DocValuesReaderState (will > > > take > > > >>> more time and memory) at search time, loading specific fields will > > > take less > > > >>> time and memory, hope so. Kindly help to solve. > > > >>> > > > >>> > > > >>> It will do it in a minimal index and search cost, I think. And hope > > > this > > > >>> won't put overload at index time, also at search time this will be > > > better. > > > >>> > > > >>> > > > >>> Kindly post your suggestions. > > > >>> > > > >>> > > > >>> Regards, > > > >>> Chitra > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless > > > >>> wrote: > > > >>>> > > > >>>> I think you've summed up exactly the differences! > > > >>>> > > > >>>> And, yes, it would be possible to emulate hierarchical facets on > top > > > >>>> of flat facets, if the hierarchy is fixed depth like > year/month/day. > > > >>>> > > > >>>> But if it's variable depth, it's trickier (but I think still > > > >>>> possible). See e.g. the Committed Paths drill-down on the left, > on > > > >>>> our dog-food server > > > >>>> http://jirasearch.mikemccandless.com/search.py?index=jira > > > >>>> > > > >>>> Mike McCandless > > > >>>> > > > >>>> http://blog.mikemccandless.com > > > >>>> > > > >>>> > > > >>>> On Fri, Nov 18, 2016 at 1:43 AM, Chitra R > > > wrote: > > > >>>> > case 1: > > > >>>> > In taxonomy, for each indexed document, examines facet > > > label , > > > >>>> > computes their ordinals and mappings, and which will be stored > in > > > >>>> > sidecar > > > >>>> > index at index time. > > > >>>> > > > > >>>> > case 2: > > > >>>> > In doc values, these(ordinals) are computed at search > > time, > > > so > > > >>>> > there > > > >>>> > will be a time and memory trade-off between both cases, hope so. > > > >>>> > > > > >>>> > > > > >>>> > In taxonomy, building hierarchical facets at index time makes > > > faceting > > > >>>> > cost > > > >>>> > minimal at search time than flat facets in doc values. > > > >>>> > > > > >>>> > Except (memory,time and NRT latency) , Is any another contrast > > > between > > > >>>> > hierarchical and flat facets at search time? > > > >>>> > > > > >>>> > > > > >>>> > Kindly post your suggestions... > > > >>>> > > > > >>>> > > > > >>>> > Regards, > > > >>>> > Chitra > > > >>>> > > > > >>>> > On Thu, Nov 17, 2016 at 6:40 PM, Chitra R < > chithu.r111@gmail.com> > > > >>>> > wrote: > > > >>>> >> > > > >>>> >> Okay. I agree with you, Taxonomy maintains and supports > > > hierarchical > > > >>>> >> facets during indexing. Hope hierarchical in the sense, we > might > > > >>>> >> index the > > > >>>> >> field Publish date : 2010/10/15 as Publish date: 2010 , Publish > > > date: > > > >>>> >> 2010/10 and Publish date: 2010/10/15 , their facet ordinals are > > > >>>> >> maintained > > > >>>> >> in sidecar index and it is mapped to the main index. > > > >>>> >> > > > >>>> >> For example: > > > >>>> >> > > > >>>> >> In search-lucene.com , I enter a term (say > > facet), > > > >>>> >> top > > > >>>> >> documents and their categories are displayed after performing > the > > > >>>> >> search. > > > >>>> >> Say I drill down through Publish date/2010 to collect its child > > > >>>> >> counts and > > > >>>> >> after I will pass through publishdate/2010/10 to collect their > > > child > > > >>>> >> counts. > > > >>>> >> And for each drill down, each search will be performed to > collect > > > its > > > >>>> >> top > > > >>>> >> docs and categories. > > > >>>> >> > > > >>>> >> > > > >>>> >> Even I can achieve this in flat facets by > changing > > > the > > > >>>> >> drill down query. > > > >>>> >> > > > >>>> >> Am I right or missed anything? yet I don't know if I missed > > > >>>> >> anything... > > > >>>> >> > > > >>>> >> So What is the need of hierarchical facets? Could you please > > > explain > > > >>>> >> it(hierarchical facets) in the real-world use case? > > > >>>> >> > > > >>>> >> > > > >>>> >> Regards, > > > >>>> >> Chitra > > > >>>> >> > > > >>>> >> On Wed, Nov 16, 2016 at 7:36 PM, Michael McCandless > > > >>>> >> wrote: > > > >>>> >>> > > > >>>> >>> You store dimension + string (a single value path, since it's > > not > > > >>>> >>> hierarchical) into SSDVFF so that you can compute facet > counts, > > > >>>> >>> either > > > >>>> >>> ordinary drill down counts or the drill sideways counts. > > > >>>> >>> > > > >>>> >>> You can see examples of drill sideways at > > > >>>> >>> http://jirasearch.mikemccandless.com, e.g. drill down on any > of > > > >>>> >>> those > > > >>>> >>> fields on the left and you don't lose the previous facet > counts > > > for > > > >>>> >>> that field. > > > >>>> >>> > > > >>>> >>> Mike McCandless > > > >>>> >>> > > > >>>> >>> http://blog.mikemccandless.com > > > >>>> >>> > > > >>>> >>> > > > >>>> >>> On Wed, Nov 16, 2016 at 8:51 AM, Chitra R < > > chithu.r111@gmail.com> > > > >>>> >>> wrote: > > > >>>> >>> > Hi, > > > >>>> >>> > > > > >>>> >>> > Lucene-Drill sideways > > > >>>> >>> > > > > >>>> >>> > jira_issue:LUCENE-4748 > > > >>>> >>> > > > > >>>> >>> > Is this the reason( ie > Drill > > > >>>> >>> > sideways > > > >>>> >>> > makes > > > >>>> >>> > a very nice faceted search UI because we > > > >>>> >>> > don't "lose" the facet counts after drilling in) behind > > storing > > > >>>> >>> > path > > > >>>> >>> > and > > > >>>> >>> > dimension for the given SSDVF field? Else anything? > > > >>>> >>> > > > > >>>> >>> > Regards, > > > >>>> >>> > Chitra > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > Hey, thank you so much for the fast response, I agree > NRT > > > >>>> >>> > refresh > > > >>>> >>> > is > > > >>>> >>> > somewhat costly operations and this is the major pitfall, > > > suppose > > > >>>> >>> > we > > > >>>> >>> > use doc > > > >>>> >>> > value faceting. > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > While indexing > SortedSetDocValuesFacetField , > > > it > > > >>>> >>> > stores > > > >>>> >>> > path and dimension of the given field internally. So Can we > > > >>>> >>> > achieve > > > >>>> >>> > hierarchical facets using DrillDownQuery? Hope, purpose of > > > storing > > > >>>> >>> > path > > > >>>> >>> > and > > > >>>> >>> > dimension is to achieve hierarchical facets. If yes (ie we > can > > > >>>> >>> > achieve > > > >>>> >>> > hierarchy in SSDVFF) , so what is the need to move over > > > taxonomy? > > > >>>> >>> > Else I missed anything? > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > What is the real purpose to store path and > > > >>>> >>> > dimension > > > >>>> >>> > in > > > >>>> >>> > SSDVF field? > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > Kindly post your suggestions. > > > >>>> >>> > > > > >>>> >>> > Regards, > > > >>>> >>> > Chitra > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >>> > On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless > > > >>>> >>> > wrote: > > > >>>> >>> >> > > > >>>> >>> >> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R < > > > chithu.r111@gmail.com> > > > >>>> >>> >> wrote: > > > >>>> >>> >> > > > >>>> >>> >> > i)Hope, when opening > SortedSetDocValuesReaderState > > , > > > we > > > >>>> >>> >> > are > > > >>>> >>> >> > calculating ordinals( this will be used to calculate > facet > > > >>>> >>> >> > count ) > > > >>>> >>> >> > for > > > >>>> >>> >> > doc > > > >>>> >>> >> > values field and this only made the state instance > somewhat > > > >>>> >>> >> > costly. > > > >>>> >>> >> > Am I right or any other reason > behind > > > >>>> >>> >> > that? > > > >>>> >>> >> > > > >>>> >>> >> That's correct. It adds some latency to an NRT refresh, > and > > > some > > > >>>> >>> >> heap > > > >>>> >>> >> used to hold the ordinal mappings. > > > >>>> >>> >> > > > >>>> >>> >> > ii) During indexing, we are providing facet > > ordinals > > > >>>> >>> >> > in > > > >>>> >>> >> > each > > > >>>> >>> >> > doc > > > >>>> >>> >> > and I think it will be useful in search side, to > calculate > > > >>>> >>> >> > facet > > > >>>> >>> >> > counts > > > >>>> >>> >> > only for matching docs. otherwise, it carries any other > > > >>>> >>> >> > benefits? > > > >>>> >>> >> > > > >>>> >>> >> Well, compared to the taxonomy facets, SSDV facets don't > > > require > > > >>>> >>> >> a > > > >>>> >>> >> separate index. > > > >>>> >>> >> > > > >>>> >>> >> But they add latency/heap usage, and they cannot do > > > hierarchical > > > >>>> >>> >> facets yet (though this could be fixed if someone just > built > > > it). > > > >>>> >>> >> > > > >>>> >>> >> > iii) Is SortedSetDocValuesReaderState > thread-safe > > > (ie) > > > >>>> >>> >> > multiple > > > >>>> >>> >> > threads can call this method concurrently? > > > >>>> >>> >> > > > >>>> >>> >> Yes. > > > >>>> >>> >> > > > >>>> >>> >> Mike McCandless > > > >>>> >>> >> > > > >>>> >>> >> http://blog.mikemccandless.com > > > >>>> >>> > > > > >>>> >>> > > > > >>>> >> > > > >>>> >> > > > >>>> > > > > >>> > > > >>> > > > >> > > > > > > > > > > --001a113a268633821305428116f3--