lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: Multi-field distinct query
Date Wed, 16 May 2007 18:15:04 GMT
[Moving the discussion from java-dav to java-user]

Hi Terry,

The index I suggested would be in addition to the one you now have, so
the requirements you mentioned would not be affected (unless of course
you have a single-index requirement :) ).

Given the narrow use for this additional index, you might be able to
address your memory issues either by not storing the "children" field
terms, or by using Lazy Field Loading - see the api docs for
IndexReader.document(int, FieldSelector):

<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/IndexReader.html#document(int,%20org.apache.lucene.document.FieldSelector)>

Grant Ingersoll's "Advanced Lucene" talk at ApacheCon Europe earlier
this month mentioned Lazy Field Loading - see the second to last slide
in the slide presentation, available on the Lucene wiki Resources page,
under Presentations:

<http://wiki.apache.org/lucene-java/Resources#head-538bbb76157074df20f3a9a30195c74d68772f4b>

Steve

dontspamterry wrote:
> Hi Steve,
> 
> We originally had documents which were term-centric, i.e. what you described
> - document for the parent and all of its children. We changed it to model a
> single, parent-child relation as one document due to requirements and the
> fact that we were having memory issues for cases where a parent had an
> extremely large number of children (~200,000).
> 
> -Terry
> 
> 
> Steven Rowe wrote:
>> Hi Terry,
>>
>> Why not have another index in which a document has one field for the
>> parent and another field containing all of its children.  An OR query
>> over the "children" field would return you exactly what you want - one
>> document for each distinct parent.
>>
>> Steve
>>
>> dontspamterry wrote:
>>> Hi all,
>>>
>>> I know this whole distinct query has been discussed a bunch of times for
>>> various scenarios because I've been scouring the forums trying to find a
>>> clue as to how I could solve my problem. I'm indexing a large set of
>>> parent-child term relations (~1 million). The number of unique terms is
>>> about ~570,000. Each relation is a document. Each term in a relation
>>> contains all of the term's attributes. Effectively, a term's attributes
>>> will
>>> be duplicated "x" number of times for the "x" number of relations it
>>> participates in. For example, say I have the following term tree:
>>>
>>> A
>>> |--B
>>>     |--E
>>>         |--H
>>>     |--F
>>> |--C
>>>     |--G
>>> |--D
>>>
>>> I would then have documents for:
>>> A->B, A->C, A->D, B->E, (and so forth...)
>>>
>>> For all relations involving A, A's attributes will be duplicated in 3
>>> separate documents.
>>> For all relations involving B, B's attributes will be duplicated in 3
>>> separate documents.
>>> (you get the picture...)
>>>
>>> This index structure works great for queries which traverse up and down
>>> the
>>> tree. However, I have a requirement where I would also like to do a
>>> distinct
>>> query which returns the data for each unique term satisfying the query.
>>> For
>>> example, say I have a query which returns all relations where A or B is
>>> the
>>> parent (that would be 5 documents in total),
>>> but do a distinct on the parent such that I get 2 documents back, one for
>>> A
>>> as the parent (any 1 of the 3 matching docs)  and the other where B is
>>> the
>>> parent (any 1 of the 2 matching docs). For this query, I don't care about
>>> the child information since I'm only interested in retrieving the
>>> distinct
>>> parent terms. This query is analogous to a 'select distinct <set of
>>> parent
>>> term attributes>' . I played around with caching BitSets for the fields
>>> which I'd like to do a distinct on, but given the amount of data, I run
>>> out
>>> of memory. I also took the approach where I retrieve the bitset using a
>>> queryfilter and then process each document id, hashing the field values
>>> on
>>> which I'm doing a distinct to construct my distinct set. Problem with
>>> this
>>> is that I have tree structures where a parent has over 100K children.
>>> Retrieving each doc for this size is too time- and memory- consuming.
>>> Since
>>> I don't really want to return that much data, I thought that I could use
>>> paging. The problem I faced is that I do not know if a distinct value in
>>> the
>>> current query was actually returned in some previous query for a previous
>>> page.
>>>
>>> Sorry for the long description, but wanted to make sure I explained it as
>>> clearly as I could.
>>>
>>> -Terry


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message