lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Multi-field distinct query
Date Wed, 16 May 2007 01:07:39 GMT
I suggest you ask on the user mailing list (java- as you are likely to get a lot more interest  
from others.  java-dev is for discussing the internals of how Lucene  


On May 15, 2007, at 7:13 PM, dontspamterry wrote:

> Hi all,
> I know this whole distinct query has been discussed a bunch of  
> times for
> various scenarios because I've been scouring the forums trying to  
> find a
> clue as to how I could solve my problem. I'm indexing a large set of
> parent-child term relations (~1 million). The number of unique  
> terms is
> about ~570,000. Each relation is a document. Each term in a relation
> contains all of the term's attributes. Effectively, a term's  
> attributes will
> be duplicated "x" number of times for the "x" number of relations it
> participates in. For example, say I have the following term tree:
> A
> |--B
>     |--E
>         |--H
>     |--F
> |--C
>     |--G
> |--D
> I would then have documents for:
> A->B, A->C, A->D, B->E, (and so forth...)
> For all relations involving A, A's attributes will be duplicated in 3
> separate documents.
> For all relations involving B, B's attributes will be duplicated in 3
> separate documents.
> (you get the picture...)
> This index structure works great for queries which traverse up and  
> down the
> tree. However, I have a requirement where I would also like to do a  
> distinct
> query which returns the data for each unique term satisfying the  
> query. For
> example, say I have a query which returns all relations where A or  
> B is the
> parent (that would be 5 documents in total),
> but do a distinct on the parent such that I get 2 documents back,  
> one for A
> as the parent (any 1 of the 3 matching docs)  and the other where B  
> is the
> parent (any 1 of the 2 matching docs). For this query, I don't care  
> about
> the child information since I'm only interested in retrieving the  
> distinct
> parent terms. This query is analogous to a 'select distinct <set of  
> parent
> term attributes>' . I played around with caching BitSets for the  
> fields
> which I'd like to do a distinct on, but given the amount of data, I  
> run out
> of memory. I also took the approach where I retrieve the bitset  
> using a
> queryfilter and then process each document id, hashing the field  
> values on
> which I'm doing a distinct to construct my distinct set. Problem  
> with this
> is that I have tree structures where a parent has over 100K children.
> Retrieving each doc for this size is too time- and memory-  
> consuming. Since
> I don't really want to return that much data, I thought that I  
> could use
> paging. The problem I faced is that I do not know if a distinct  
> value in the
> current query was actually returned in some previous query for a  
> previous
> page.
> Sorry for the long description, but wanted to make sure I explained  
> it as
> clearly as I could.
> -Terry
> -- 
> View this message in context: 
> distinct-query-tf3761682.html#a10633050
> Sent from the Lucene - Java Developer mailing list archive at  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll
Center for Natural Language Processing

Read the Lucene Java FAQ at 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message