lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9142) JSON Facet, add hash table method for terms
Date Thu, 01 Sep 2016 03:56:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454221#comment-15454221
] 

David Smiley commented on SOLR-9142:
------------------------------------

bq. Perhaps we could use the "Bits" interface instead when we want/require a fast random access
set.

Do you mean this?: Code that needs a fast set would be changed to work on a Bits interface,
and we'd change HashDocSet to be a hypothetical HashBits and implement Bits.  Meanwhile any
DocSet that is already fast for random sets could be enhanced to either implement Bits or
expose the Bits?   +1 to that.

bq. I was surprised this adds a method (dvhash). 

Even if we had a heuristic to auto-pick this, nonetheless sometimes the user knows best. 
Ok; I could imagine the herustic number itself might be tunable to get the intended effect.
 So if we can add auto-tuning in before v6.3 then we don't need a method=dvhash.

bq. Although perhaps convenient for testing things out, it would be tedious in production
since the best method will depend on the domain size, which will often not be known ahead
of time by the user. For the normal "dv" method, we should definitely make it pick hashing
when the domain is much smaller than the number of unique terms in the field. We already do
stuff like this in the DV faceting to pick whether we accumulate global ords, or accumulate
local (per-seg) ords first and then do a mapping at the end to global ords.

Certainly that would be nice... it's a TODO in the method selection code to auto-pick dvhash.
 If it was trivial I would have added it... but the method-selection code doesn't conveniently
have access to the Terms/DocValues to know the stats, furthermore we might want to try to
get stats from Terms and/or DocValues depending on which is available. There are other TODOs
as well like supporting multi-valued and prefix.



> JSON Facet, add hash table method for terms
> -------------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch,
SOLR_9412_FacetFieldProcessorByHashDV.patch, SOLR_9412_FacetFieldProcessorByHashDV.patch,
SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and {{sub_facet_unique_td}}
which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The nested query
for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> 	{
> 		"top_facet_s": {
> 			"type": "terms",
> 			"limit": -1,
> 			"field": "top_facet_s",
> 			"mincount": 1,
> 			"excludeTags": "ANY",
> 			"facet": {
> 				"sub_facet_unique_s": {
> 					"type": "terms",
> 					"limit": 1,
> 					"field": "sub_facet_unique_s",
> 					"mincount": 1
> 				}
> 			}
> 		}
> 	}
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> 	{
> 		"top_facet_s": {
> 			"type": "terms",
> 			"limit": -1,
> 			"field": "top_facet_s",
> 			"mincount": 1,
> 			"excludeTags": "ANY",
> 			"facet": {
> 				"sub_facet_unique_s": {
> 					"type": "terms",
> 					"limit": 1,
> 					"field": "sub_facet_unique_td",
> 					"mincount": 1
> 				}
> 			}
> 		}
> 	}
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow compared
to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets on each
of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call {{createCollectAcc}}
with nDocs=0 and numSlots=2M . This then initializes an array of 2M. So we create a 2M array
1000 times for this one query which from what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a CountSlotAcc which
doesn't assign a huge array. In this query it calls {{createCollectAcc}} with numDocs=2k and
numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and we use the
array position as the ordinal and value as the count. If we could improve on this it would
speed things up significantly? For sub-facets we know the maximum cardinality can be at max
the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message