lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tricia Williams <pgwil...@student.cs.uwaterloo.ca>
Subject Re: BitSet in a HitCollector
Date Thu, 06 Jul 2006 15:59:33 GMT
Hi James,

    A paper was mentioned on this list in the last couple of months which 
presents a solution to your sampling problem without having to know the 
total results size in advance.  The paper 
(http://www2005.org/cdrom/docs/p245.pdf) presents two solutions which 
utilize a random variable.  One solution has you traverse the result set 
and select each document with probability p.  P is determined in advance. 
Alternately, the paper describes an algorithm (bottom of page 248) for 
determining a skip value which, while similar to the traversal, allows you 
to jump/skip over documents and save the probability computations for each 
document required by the first solution.

    I hope this helps!

Tricia

On Thu, 6 Jul 2006, James Pine wrote:

> Hey,
>
> Sorry, I will explain a bit more about my collect
> method. Currently my collect method is executing
> IndexSearcher.doc(id) and storing some stuff in a Map
> which I can then retrieve from the HitCollector (much
> like the example in the Lucene In Action book). Of
> course that's somewhat expensive, so I'd like to do
> some statistical sampling based on the result set size
> to try and speed things up.
>
> The way I was thinking about doing this was, during
> the collect method only executing
> IndexSearcher.doc(id) on every Nth document, where N
> is calculated dynamically based on a minimum number X.
> The rule would be:
>
> N = Max(1,(numResults / X))
>
> In order to do this in the collect method, I need to
> know the total number of results before ever invoking
> the collect method right? That seemed to make a case
> for the BitSet/QueryFilter in the constructor.
>
> In addition, someone else on the list mentioned that
> one of the reasons calling IndexSearcher.doc(id) in
> the collect method was that it caused the disk to do a
> lot of seeking. Maybe that's a moot point if one is
> using a RAMDirectory or an FSDirectory small enough
> that it gets cached by the OS anyway, but if it's not,
> then I thought it might be more performant to have the
> hitcollector set the Bits in the collect method and
> then do another pass to do the statistical sampling.
>
> Either way it seems that to do the statistical
> sampling that I envision I either need to calculate
> the total result count/document id set in the
> constructor, before calling the collect method, or
> calculate the total result count/document id set in
> the collect method and then execute some sort of
> post-collect method, right? So I was just wondering
> which method was better/faster. Thanx.
>
> JAMES
>
> --- Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
>>
>> : I'm using a HitCollector and would like to know
>> the
>> : total number of results that matched a given
>> query.
>> : Based on the JavaDoc, I this will do the trick:
>>
>> you don't need a BitSet in that case, you could find
>> that out just using
>> an int...
>>
>>     public CountingCollector extends HitCollector {
>>       public int count = 0;
>>       public void collect(int doc, float score) {
>> count++ };
>>     }
>>     CountingCollector c = new CountingCollector();
>>     searcher.search(query, c)
>>     int numResults = c.count;
>>
>> : If I want to know the total number of results
>> inside
>> : of the HitCollector, i.e. before the collect
>> method
>> : has ever been called, I think I could pass the
>> Query
>> : and Searcher objects into the HitCollector and do
>> this
>> : in its constructor:
>> :
>> : BitSet bits = (new
>> :
>> QueryFilter(query)).bits(searcher.getIndexReader());
>> : int numResults = bits.cardinality();
>>
>> This question doesn't make a lot of sense to me, why
>> do you need to know
>> the total number ofresults before the collect method
>> is called? .. what
>> you are suggesting here (using QueryFilter in this
>> way) is perfectly
>> legal, but it's going to do just as much work as
>> using a HitCollector will
>> (possibly more, i can't remember).
>>
>> : Is Lucene executing another pass over the index in
>> : order to populate the BitSet and then doing
>> another
>> : pass while calling the collect method? Thanx.
>>
>> in your last example, you never us your
>> HitCollector, so i'm not sure what
>> you mean, but assuming you aresking about combining
>> those examples into
>> something like this....
>>
>>   Searcher searcher = new
>> IndexSearcher(indexReader);
>>   BitSet bits = (new
>> QueryFilter(query)).bits(searcher.getIndexReader());
>>   final int numResults = bits.cardinality();
>>   searcher.search(query, new HitCollector() {
>>        public void collect(int doc, float score) {
>>           /* do something with numResults and doc
>> and score */
>>        }
>>   });
>>
>> ...then yes, you are most definitely making two
>> passes to do do that.
>>
>>
>>
>> -Hoss
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message