lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dontspamterry <>
Subject Re: Multi-field distinct query
Date Wed, 16 May 2007 16:31:53 GMT

Hi Steve,

We originally had documents which were term-centric, i.e. what you described
- document for the parent and all of its children. We changed it to model a
single, parent-child relation as one document due to requirements and the
fact that we were having memory issues for cases where a parent had an
extremely large number of children (~200,000).


Steven Rowe wrote:
> Hi Terry,
> Why not have another index in which a document has one field for the
> parent and another field containing all of its children.  An OR query
> over the "children" field would return you exactly what you want - one
> document for each distinct parent.
> Steve
> dontspamterry wrote:
>> Hi all,
>> I know this whole distinct query has been discussed a bunch of times for
>> various scenarios because I've been scouring the forums trying to find a
>> clue as to how I could solve my problem. I'm indexing a large set of
>> parent-child term relations (~1 million). The number of unique terms is
>> about ~570,000. Each relation is a document. Each term in a relation
>> contains all of the term's attributes. Effectively, a term's attributes
>> will
>> be duplicated "x" number of times for the "x" number of relations it
>> participates in. For example, say I have the following term tree:
>> A
>> |--B
>>     |--E
>>         |--H
>>     |--F
>> |--C
>>     |--G
>> |--D
>> I would then have documents for:
>> A->B, A->C, A->D, B->E, (and so forth...)
>> For all relations involving A, A's attributes will be duplicated in 3
>> separate documents.
>> For all relations involving B, B's attributes will be duplicated in 3
>> separate documents.
>> (you get the picture...)
>> This index structure works great for queries which traverse up and down
>> the
>> tree. However, I have a requirement where I would also like to do a
>> distinct
>> query which returns the data for each unique term satisfying the query.
>> For
>> example, say I have a query which returns all relations where A or B is
>> the
>> parent (that would be 5 documents in total),
>> but do a distinct on the parent such that I get 2 documents back, one for
>> A
>> as the parent (any 1 of the 3 matching docs)  and the other where B is
>> the
>> parent (any 1 of the 2 matching docs). For this query, I don't care about
>> the child information since I'm only interested in retrieving the
>> distinct
>> parent terms. This query is analogous to a 'select distinct <set of
>> parent
>> term attributes>' . I played around with caching BitSets for the fields
>> which I'd like to do a distinct on, but given the amount of data, I run
>> out
>> of memory. I also took the approach where I retrieve the bitset using a
>> queryfilter and then process each document id, hashing the field values
>> on
>> which I'm doing a distinct to construct my distinct set. Problem with
>> this
>> is that I have tree structures where a parent has over 100K children.
>> Retrieving each doc for this size is too time- and memory- consuming.
>> Since
>> I don't really want to return that much data, I thought that I could use
>> paging. The problem I faced is that I do not know if a distinct value in
>> the
>> current query was actually returned in some previous query for a previous
>> page.
>> Sorry for the long description, but wanted to make sure I explained it as
>> clearly as I could.
>> -Terry
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View this message in context:
Sent from the Lucene - Java Developer mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message