Hi Terry,
Why not have another index in which a document has one field for the
parent and another field containing all of its children. An OR query
over the "children" field would return you exactly what you want  one
document for each distinct parent.
Steve
dontspamterry wrote:
> Hi all,
>
> I know this whole distinct query has been discussed a bunch of times for
> various scenarios because I've been scouring the forums trying to find a
> clue as to how I could solve my problem. I'm indexing a large set of
> parentchild term relations (~1 million). The number of unique terms is
> about ~570,000. Each relation is a document. Each term in a relation
> contains all of the term's attributes. Effectively, a term's attributes will
> be duplicated "x" number of times for the "x" number of relations it
> participates in. For example, say I have the following term tree:
>
> A
> B
> E
> H
> F
> C
> G
> D
>
> I would then have documents for:
> A>B, A>C, A>D, B>E, (and so forth...)
>
> For all relations involving A, A's attributes will be duplicated in 3
> separate documents.
> For all relations involving B, B's attributes will be duplicated in 3
> separate documents.
> (you get the picture...)
>
> This index structure works great for queries which traverse up and down the
> tree. However, I have a requirement where I would also like to do a distinct
> query which returns the data for each unique term satisfying the query. For
> example, say I have a query which returns all relations where A or B is the
> parent (that would be 5 documents in total),
> but do a distinct on the parent such that I get 2 documents back, one for A
> as the parent (any 1 of the 3 matching docs) and the other where B is the
> parent (any 1 of the 2 matching docs). For this query, I don't care about
> the child information since I'm only interested in retrieving the distinct
> parent terms. This query is analogous to a 'select distinct <set of parent
> term attributes>' . I played around with caching BitSets for the fields
> which I'd like to do a distinct on, but given the amount of data, I run out
> of memory. I also took the approach where I retrieve the bitset using a
> queryfilter and then process each document id, hashing the field values on
> which I'm doing a distinct to construct my distinct set. Problem with this
> is that I have tree structures where a parent has over 100K children.
> Retrieving each doc for this size is too time and memory consuming. Since
> I don't really want to return that much data, I thought that I could use
> paging. The problem I faced is that I do not know if a distinct value in the
> current query was actually returned in some previous query for a previous
> page.
>
> Sorry for the long description, but wanted to make sure I explained it as
> clearly as I could.
>
> Terry

