lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kainth, Sachin" <>
Subject RE: Counting and Categorisation
Date Thu, 08 Feb 2007 13:55:26 GMT
Hi Erik, 

Thanks for the reply.  Since writing this I have in fact now implemented
the BitSet version and it works quite successfully.  However, I have now
found out that we will be dealing with millions of records and that for
this reason we can not use such a solution.  Can you tell me what solr
is as it seems to be more suited to my application?



-----Original Message-----
From: Erik Hatcher [] 
Sent: 08 February 2007 13:48
Subject: Re: Counting and Categorisation

On Feb 8, 2007, at 8:28 AM, Kainth, Sachin wrote:

> This email is meant for Chris Hostetter and of course anyone else who 
> may know about this,
> I wonder if I can ask you a question.  I have been reading of how you 
> at CNET have implemented categorisation and counting so that if i type

> "Kodak Easyshare" in the reviews section you not only get a big list 
> of all documents about this but you also get a list of categories in 
> which "Kodak EasyShare" appears.  So for example it will say that 
> there are documents in the "Digital cameras" category which contain 
> "Kodak EasyShare" and also documents in "Peripherals" with that same 
> query.
> I'd like to do the same thing as this and I'm not sure I've fully 
> understood the explainations I've read so far.  I know you have 
> described using lots of bitsets to do this but I'm not too clear on 
> the details.
> Let me explain what I want to do.  It is very simple.  I have a Lucene

> index containing just 3 fields (I mean field in the sense that you can

> use the fieldName:searchTerm query syntax to search for the value 
> searchTerm in fieldName).  The fields are "artist", "track" and 
> "album".
> What I want to do is if the user searches of the text "love" in the 
> track field they get a list of all the artists who have a track with 
> "love" in the title plus a list of all the albums with the word "love"
> in the title.  Along with these album and artist names I want a count 
> of the number of songs in each category.  If the user clicks on one of

> these categories then that result subset is returned.
> At the moment I just return the full list of artists, albums and 
> tracks which I want as well.  What I've described above will be in a 
> top bar which will allow the user to refine their search.
> What I'm asking then is for some specific information about how I can 
> perform the categorisation and counts.

There are two ways to go about this:

   1) Use Solr.

   2) If the number of unique artists and albums is reasonable enough,
build BitSet's in memory into a Map.  When someone searches for "love"
(and who doesn't?) get a BitSet of the matching documents (using a
HitCollector, or QueryFilter) and intersect it with all the ones in the
Map.  I use (but working on phasing things into a better fit with Solr
and scalability) this same scheme on Collex at <http://> for all the facets on the right (though I do
leverage some of Solr's goodies, my original implementation successfully
used the BitSet-Map-in-memory scheme (now I use TermQuery's-in-memory,
but leverage Solr's DocSet caching instead of  
BitSets).  Its very fast!   The cons to doing all this yourself is  
when things get bigger you gotta change how it works to scale, and Solr
already has a lot of infrastructure in place for this eventuality.


To unsubscribe, e-mail:
For additional commands, e-mail:

This message has been scanned for viruses by MailControl - (see

This email and any attached files are confidential and copyright protected. If you are not
the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise
expressly agreed in writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered in England No.
1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need to. 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message