lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: cluster documents based on fields' values
Date Tue, 17 Aug 2010 11:56:57 GMT
Hi Nik,

Inline below.
On Aug 15, 2010, at 5:01 PM, Nik Kolev wrote:

> Hi,
> 
> I am researching the possibility of using Lucene for discovering
> clusters of documents and since I am new to Lucene I decided to
> ask the community for advice before I poke the APIs and the internals.
> Your input will be invaluable!
> 
> Here's the use case. Documents arrive from different feeds. Each feed
> produces millions and millions of documents. Documents are structured
> and share certain "interesting" fields. For the purpose of illustration
> here's a trivial example. Let's say the documents on each feed represent
> shoes. A shoe has an ID (uniquely identifying it within its feed) and a
> SIZE [1]. I want to be able to ask Lucene (assuming the shoe documents
> are indexed of course) for all of the clusters of shoes that sport the same
> size. A cluster of shoes is just the IDs of the shoes that got grouped
> together due to having the same value of the SIZE field.
> 
> I don't think that doing this brute force will perform. Here's what I
> mean when I say brute force. I looked at IndexReader, and I saw that
> I can get the distinct values for an indexed field. So assuming that
> each feed will have its own index (I don't think I can get away with
> a single index for all feeds [2]), I can get the union of all distinct
> values for the interesting field across all indexes (one per each
> document feed). Then for each distinct value I can do a MultiSearcher
> search across these indexes getting the IDs of the documents .
> 
> My gut tells me that the brute force approach won't perform [3]. And
> here's where you guys come in - is it possible to ask lucene to give
> me the groups of records (across indixes) that share the value for a
> given field? Is there something else in API that I can take advantage
> of and get what I need faster? If I am to extend Lucene to allow for
> this sort of thing, where should I start, what documents should I
> read,...?

This feature, called faceting -- amongst other things, comes out of the box in Solr (http://lucene.apache.org/solr)

> 
> Thanks very much again for your advise,
> -nik
> 
> [1] Shoe document layout:
>    SHOE
>      |--ID
>      |--SIZE
> 
> [2] The volume of documents I get from each feed is huge. Also, even
>    though a lot of fields are shared the format across feeds varies
>    and a feed may not have some of the "interesting" fields of
>    another feed.
> 
> [3] The dataset I am talking about is rich and the brute force approach
>    means that I need to be doing tens of millions of searches just to
>    group on one field. Also I most likely will blow my heap up if I try to
>    load all of the values in memory all at once.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message