lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Tsang <>
Subject Re: Search Results Clustering
Date Wed, 31 Aug 2005 05:10:36 GMT
I had similar requirements of "count" and "group by" on over 130mil
records, it's really a pain.  It's currently usable but not

Currently it's grouping at run-time by iterating through ungrouped
items.  It collects matching documents into BitSet, so subsequent
queries can use BitSet to retrieve the results of original query. 
Moreover, it can mark off documents that are already being grouped
from the BitSet.

In a page that shows 10 records/page, it will only group 10 records at
a time. Consequently, there is no way to know the total number grouped
records in the beginning.

In addition, it feels like reading the field values from the document
in order to look for group-by results is most time consuming.

How does RDBMS do it?


On 8/31/05, kapilChhabra (sent by <> wrote:
> thanks a lot for your suggestion.
> I'll try it and get back if need be.
> Meanwhile, I gave it a thought and concluded that the best time to do the categorization/clustering
should be lucene calculates Hits/in the Scrorer.
> I am not sure if I am right.
> In addition to the current functionality can we modify the Scorer class add the following
> The class generates a 2 dimentional array for the clustered field, the first dimention
contains the distinct values of the field and the second dimention contains the count of results
under this field. This value is incremented for an acceptible hit.
> Does it make sense?
> If it is possible, i'll dig deeper into the code of the Hits/Scorer classes.
> Thanks in advance,
> kapilChhabra
> --
> Sent from the Lucene - Java Users forum at
View raw message