lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dontspamterry <dontspamte...@yahoo.com>
Subject Re: Multi-field distinct query
Date Wed, 16 May 2007 16:32:37 GMT

Thanks for the tip. I've re-posted on Lucene-Java Users.

-Terry


Grant Ingersoll-6 wrote:
> 
> I suggest you ask on the user mailing list (java- 
> user@lucene.apache.org) as you are likely to get a lot more interest  
> from others.  java-dev is for discussing the internals of how Lucene  
> works.
> 
> 
> 
> Thanks,
> Grant
> 
> On May 15, 2007, at 7:13 PM, dontspamterry wrote:
> 
>>
>> Hi all,
>>
>> I know this whole distinct query has been discussed a bunch of  
>> times for
>> various scenarios because I've been scouring the forums trying to  
>> find a
>> clue as to how I could solve my problem. I'm indexing a large set of
>> parent-child term relations (~1 million). The number of unique  
>> terms is
>> about ~570,000. Each relation is a document. Each term in a relation
>> contains all of the term's attributes. Effectively, a term's  
>> attributes will
>> be duplicated "x" number of times for the "x" number of relations it
>> participates in. For example, say I have the following term tree:
>>
>> A
>> |--B
>>     |--E
>>         |--H
>>     |--F
>> |--C
>>     |--G
>> |--D
>>
>> I would then have documents for:
>> A->B, A->C, A->D, B->E, (and so forth...)
>>
>> For all relations involving A, A's attributes will be duplicated in 3
>> separate documents.
>> For all relations involving B, B's attributes will be duplicated in 3
>> separate documents.
>> (you get the picture...)
>>
>> This index structure works great for queries which traverse up and  
>> down the
>> tree. However, I have a requirement where I would also like to do a  
>> distinct
>> query which returns the data for each unique term satisfying the  
>> query. For
>> example, say I have a query which returns all relations where A or  
>> B is the
>> parent (that would be 5 documents in total),
>> but do a distinct on the parent such that I get 2 documents back,  
>> one for A
>> as the parent (any 1 of the 3 matching docs)  and the other where B  
>> is the
>> parent (any 1 of the 2 matching docs). For this query, I don't care  
>> about
>> the child information since I'm only interested in retrieving the  
>> distinct
>> parent terms. This query is analogous to a 'select distinct <set of  
>> parent
>> term attributes>' . I played around with caching BitSets for the  
>> fields
>> which I'd like to do a distinct on, but given the amount of data, I  
>> run out
>> of memory. I also took the approach where I retrieve the bitset  
>> using a
>> queryfilter and then process each document id, hashing the field  
>> values on
>> which I'm doing a distinct to construct my distinct set. Problem  
>> with this
>> is that I have tree structures where a parent has over 100K children.
>> Retrieving each doc for this size is too time- and memory-  
>> consuming. Since
>> I don't really want to return that much data, I thought that I  
>> could use
>> paging. The problem I faced is that I do not know if a distinct  
>> value in the
>> current query was actually returned in some previous query for a  
>> previous
>> page.
>>
>> Sorry for the long description, but wanted to make sure I explained  
>> it as
>> clearly as I could.
>>
>> -Terry
>> -- 
>> View this message in context: http://www.nabble.com/Multi-field- 
>> distinct-query-tf3761682.html#a10633050
>> Sent from the Lucene - Java Developer mailing list archive at  
>> Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
> LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10645430
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message