lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dontspamterry <dontspamte...@yahoo.com>
Subject Multi-field distinct query
Date Wed, 16 May 2007 16:23:04 GMT

Hi all,

I know this whole distinct query has been discussed a bunch of times for
various scenarios because I've been scouring the forums trying to find a
clue as to how I could solve my problem. I'm indexing a large set of
parent-child term relations (~1 million). The number of unique terms is
about ~570,000. Each relation is a document. Each term in a relation
contains all of the term's attributes. Effectively, a term's attributes will
be duplicated "x" number of times for the "x" number of relations it
participates in. For example, say I have the following term tree:

A
|--B
    |--E
        |--H
    |--F
|--C
    |--G
|--D

I would then have documents for:
A->B, A->C, A->D, B->E, (and so forth...)

For all relations involving A, A's attributes will be duplicated in 3
separate documents.
For all relations involving B, B's attributes will be duplicated in 3
separate documents.
(you get the picture...)

This index structure works great for queries which traverse up and down the
tree. However, I have a requirement where I would also like to do a distinct
query which returns the data for each unique term satisfying the query. For
example, say I have a query which returns all relations where A or B is the
parent (that would be 5 documents in total),
but do a distinct on the parent such that I get 2 documents back, one for A
as the parent (any 1 of the 3 matching docs)  and the other where B is the
parent (any 1 of the 2 matching docs). For this query, I don't care about
the child information since I'm only interested in retrieving the distinct
parent terms. This query is analogous to a 'select distinct <set of parent
term attributes>' . I played around with caching BitSets for the fields
which I'd like to do a distinct on, but given the amount of data, I run out
of memory. I also took the approach where I retrieve the bitset using a
queryfilter and then process each document id, hashing the field values on
which I'm doing a distinct to construct my distinct set. Problem with this
is that I have tree structures where a parent has over 100K children.
Retrieving each doc for this size is too time- and memory- consuming. Since
I don't really want to return that much data, I thought that I could use
paging. The problem I faced is that I do not know if a distinct value in the
current query was actually returned in some previous query for a previous
page.

Sorry for the long description, but wanted to make sure I explained it as
clearly as I could.

-Terry
-- 
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3765636.html#a10645247
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message