lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten F." <>
Subject RE: Faceting, Sort and DocIDSet
Date Wed, 22 Apr 2009 10:06:09 GMT

Hi Dave,

in you case a solution with one
fits. For each document number you can store an integer which represents the
facet value.
This is what org.apache.solr.request.UnInvertedField will store in your
(*John* : is there something similar in com.browseengine ? )

But UnInvertedField is designed for fields with more then one value per doc.
Possible to implement directly the solution with int[IndexReader.maxDoc()]
is more easy.
The implementation with int[IndexReader.maxDoc()] should be 150 times faster
then your current solution (and use only 16:300 of your main memory).
But I still wonder that your solution is slow, did you ever use a profiler? 
Enough Xmx? Swapping? Possible your implementation of htFacetResults.get is
slow? Possible same waiting because of synchronized code?
But btw.: Your implementation is not thread save: Think about two
htFacetResults.get before one htFacetResults.put

For INDEXORDER the MultiSearcher and ParallelMultiSearcher use the docNumber
for each index as score.
So the result is 
1. Doc with docNum 1 from first Index
2. Doc with docNum 1 from second Index
n. Doc with docNum m from first Index
n+1. Doc with docNum m from second Index

I did not know this before. And I was surprised, because the docNum use the
"starts"-Array but the score does not.
So you can use a BitSet to collect the hits. The bits itself are in correct
(and you could index and search without frequencies).

Best regrads

David Seltzer wrote:
> Karsten,
> You're right, 300 facets would be a lot. Hehe. I have one facet with
> about three hundred potential values. What I've done is create an
> FacetManager who, in another thread, sets up an map of ~300 OpenBitSets.
> One bitset for each possible value of the facet.
> Then, rather than using an iterative cardinality comparison, I use a
> HitCollector to create an set of counters. 
> public void collect(int doc, float score) {
>    //we don't care about score, all we care about is docID;
>    //we need to find out if this document is in any of our facets... if
> it is, increment a counter
>    for(SearchFacet sfTemp : arrayOfSearchFacetsValues) {
>       if(sfTemp.getBitSet().fastGet(doc)) {
>          //this is a hit!
>          long lCount = htFacetResults.get(sfTemp.getTerm().text());
>          htFacetResults.put(sfTemp.getTerm().text(), lCount+1);
>          //this code is designed for mutually exclusive 
>          //facet values... in that scenario, a hit here means
>          //that we can't have a hit anywhere else, so we should
>          //break.
>          break;
>       }
>    }
> }
> Here I seem to be running into a performance issue. It seems that when a
> resultset is small (~10,000) this method greatly outperforms the
> iterating cardinality check. However, when the resultset is large
> (300,000) the HitCollector takes twice as long to process the resultset
> as the other solution.
> Our total index typically contains about 100M documents. This is broken
> up into four monthly indexes each containing 250K documents. And a
> typical search returns < 120,000 results. Lousy searches return more
> results (IE "obama" returns nearly 800,000 documents).
> At the moment we're using ParalellMultiSearcher. When I do a search,
> across four montly indexes, ordered by INDEXORDER what I get is all of
> the hits that happened on the first of any month, then all the hits that
> happened on the second of any month. Does 'starts' behave the same way
> in ParallelMultiSearcher?
> Thanks for all your input!
> -Dave

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message