lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Audenaerde (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Fri, 28 Feb 2014 13:50:20 GMT


Rob Audenaerde commented on LUCENE-5476:

Thanks guys for the feedback (also on my language skills, I need to improve my English ;))

It might be good to allow passing the random seed, for repeatable results?
Yes! This is very sensible for testing and more 'stable' screenresults and I will add this.

Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc.
I considered sampling on the 'addDocument' but I figured it would be more expensive as then
for each hit we need to do a random() calculation.

I think SamplingFC.createDocs should return a declared SampledDocs (see later) instead of
anonymous class
I also considered this. It is far better for clarity-sake but it also costs a copy of the
original. I will try some approaches and will make sure the sampling is only done once. 

I like that this impl samples per-segment as it allows to tune the sample on a per-segment
basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow
passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling
This was more or less by accident, but indeed seems useful. All segments need the same ratio
of sampling though, else it would be really hard to correct the counts afterwards. (Or am
I missing something here?)

Maybe wrap all the parameters in a SamplingConfig?
Yes. Very useful and makes it more stable.

The old implementation let you specify different parameters such as sample size, minimum number
of documents to evaluate, maximum number of documents to evaluate etc

The old style sampling indeed had a fixed sample size, which I found very useful. However,
I have not yet found a way to implement this as I do not know the total number of results
when I start facetting, so I cannot determine the samplingRatio.  I could of course first
count all results, but that also impacts performance as I would need two passes. I will give
it some more thought, but maybe you have an idea on how to accomplish this in a better way?

> Facet sampling
> --------------
>                 Key: LUCENE-5476
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments:
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message