Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 148AE200B71 for ; Wed, 31 Aug 2016 22:57:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 131A2160AA7; Wed, 31 Aug 2016 20:57:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 59FB2160AB4 for ; Wed, 31 Aug 2016 22:57:23 +0200 (CEST) Received: (qmail 4828 invoked by uid 500); 31 Aug 2016 20:57:22 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 4707 invoked by uid 99); 31 Aug 2016 20:57:22 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2016 20:57:22 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C14DA2C1B7F for ; Wed, 31 Aug 2016 20:57:21 +0000 (UTC) Date: Wed, 31 Aug 2016 20:57:21 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SOLR-9142) JSON Facet, add hash table method for terms MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 31 Aug 2016 20:57:24 -0000 [ https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453337#comment-15453337 ] ASF subversion and git services commented on SOLR-9142: ------------------------------------------------------- Commit 7b5df8a10391f5b824e8ea1793917ff60b64b8a8 in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7b5df8a ] SOLR-9142: json.facet: new method=dvhash which works on terms. Also: (1) method=stream now requires you set sort=index asc to work (2) faceting on numerics with prefix or mincount=0 will give you an error (3) refactored similar findTopSlots into one common one in FacetFieldProcessor (4) new DocSet.collectSortedDocSet utility > JSON Facet, add hash table method for terms > ------------------------------------------- > > Key: SOLR-9142 > URL: https://issues.apache.org/jira/browse/SOLR-9142 > Project: Solr > Issue Type: Improvement > Components: Facet Module > Reporter: Varun Thacker > Assignee: David Smiley > Fix For: 6.3 > > Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch > > > I indexed a dataset of 2M docs > {{top_facet_s}} has a cardinality of 1000 which is the top level facet. > For nested facets it has two fields {{sub_facet_unique_s}} and {{sub_facet_unique_td}} which are string and double and have cardinality 2M > The nested query for the double field returns in the 1s mark always. The nested query for the string field takes roughly 10s to execute. > {code:title=nested string facet|borderStyle=solid} > q=*:*&rows=0&json.facet= > { > "top_facet_s": { > "type": "terms", > "limit": -1, > "field": "top_facet_s", > "mincount": 1, > "excludeTags": "ANY", > "facet": { > "sub_facet_unique_s": { > "type": "terms", > "limit": 1, > "field": "sub_facet_unique_s", > "mincount": 1 > } > } > } > } > {code} > {code:title=nested double facet|borderStyle=solid} > q=*:*&rows=0&json.facet= > { > "top_facet_s": { > "type": "terms", > "limit": -1, > "field": "top_facet_s", > "mincount": 1, > "excludeTags": "ANY", > "facet": { > "sub_facet_unique_s": { > "type": "terms", > "limit": 1, > "field": "sub_facet_unique_td", > "mincount": 1 > } > } > } > } > {code} > I tried to dig deeper to understand why are string nested faceting that slow compared to numeric field > Since the top facet has a cardinality of 1000 we have to calculate sub facets on each of them. Now the key difference was in the implementation of the two . > For the string field, In {{FacetField#getFieldCacheCounts}} we call {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an array of 2M. So we create a 2M array 1000 times for this one query which from what I understand makes this query slow. > For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a CountSlotAcc which doesn't assign a huge array. In this query it calls {{createCollectAcc}} with numDocs=2k and numSlots=1024 . > In string faceting, we create the 2M array because the cardinality is 2M and we use the array position as the ordinal and value as the count. If we could improve on this it would speed things up significantly? For sub-facets we know the maximum cardinality can be at max the top level bucket count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org