lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Efficient count of documents by type?
Date Thu, 22 Feb 2007 20:29:06 GMT
You might have some luck searching the mailing list for "faceted search", as
I remember there's been quite a discussion on that topic and I *think* it
applies...

Even if you use a HitCollector, you still have to categorize your document,
and all you have is the doc id to work with. But I think you'll be able to
combine a HitCollector with TermDocs and be pretty quick about getting your
results.

Something like this in your hitcollector....
TermDocs td = new TermDocs();
td.seek(new Term("objecttype", "coupons");
if (td.skipTo(docid) && td.doc == docId) {
   increment coupons counter
}

And repeat ad nauseum.

But what I'd really do is just collect an ordered list of all the doc IDs in
my hitcollector and *then* do something like...

TermDocs tdCoupons = new TermDocs();
TermDocs tdMovies = new TermDocs();

tdCoupons.seek(new Term("objecttype", "coupons");
tdMovies.seek(new Term("objecttype", "movies");

for (int docId : setOfAllDocIds) {
    if (tdCoupons.skipTo(docId) && tdCoupons.doc == docId) {
       increment coupon counter;
    }
    if (tdMovies.skipTo(docId) && tdMovies.doc == docId) {
      increment movie counter
   }
}

That way, you're progressing through all of the TermDocs in order and not
skipping around so much. I really have no clue how much more efficient this
would be......

Best
Erick

On 2/22/07, Phillip Rhodes <spamsucks@rhoderunner.com> wrote:
>
> I have a query that can return documents that represent different types of
> things (e.g. books, movies, coupons, etc)
>
> There is a  "object_type" keyword on each document, so I can tell that a
> document is a coupon or a book etc...
>
> The problem is that I need to display a count of each item type that was
> found.
> For example,  your searched returned: 67 coupons, 54 movies, 28 books...
>
> While I can loop through each document and increment some sort of counter
> by document type, sometimes I have over a 2000 documents, and this would
> mean that the query would be executed internally by lucene 20 times (for
> every 100 records).
>
> I am looking at the HitCollector, but since I would need to still get each
> and every document (to figure out if it's a coupon vs. a book), I am not
> sure if there would be any benefits.  Would using a HitCollector cause the
> query to be run only 1x vs. 20 for 2000 documents?  Would that be the only
> benefit?
>
> I would be interested in hearing what others think about this problem and
> how to best implement this functionality with lucene.
>
> Thank you.
> Phillip
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message