lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brandon Mintern <mint...@easyesi.com>
Subject Re: Split mutable logical document into two Lucene documents
Date Thu, 08 Dec 2011 22:03:29 GMT
Thank you for the pointer. I looked into nested documents, but it
appears that the implementation relies on each parent document being
indexed immediately before all of its children. Unfortunately, this
presents two problems:

1. Any optimize operation will break nesting
2. Deleting and reindexing a child would break the parent-child
hierarchy unless the parent was reindexed as well. Since this is the
problem we're trying to solve in the first place, this doesn't seem to
get us where we need to be.

We also looked at ParallelReader, but that requires the
immutable/mutable pair are added to the exact same position in
separate indexes. This is very brittle for our use, and it would
require rebuilding the entire mutable index just to change a single
value, or reindexing both the mutable and immutable information.
Neither solution is better than just keeping the mutable and immutable
data together.

I think there are some things we could do with filters, but I think it
will be easier and more flexible for us to have simple Lucene queries
return a sorted list of document IDs (our full document identifier)
and then perform set-union, set-intersection, and set-inversion
ourselves.

Thanks for your time,
Brandon

On Thu, Dec 8, 2011 at 9:57 AM, Ian Lea <ian.lea@gmail.com> wrote:
> It is conceivable that nested documents might help.
> https://issues.apache.org/jira/browse/LUCENE-2454.  I don't know
> anything about that so might be way off target.
>
>
> --
> Ian.
>
>
> On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern <mintern@easyesi.com> wrote:
>> We have a document tagging system where documents are composed of two
>> types of data:
>>
>> Rarely changed (hereafter: "immutable") data - document text and
>> metadata that we upload and almost never change. The text can be
>> hundreds of pages.
>>
>> User created (hereafter: "mutable") data - document properties that
>> are set by users of our system. In total a document's properties are
>> generally several dozen bytes at most. Even viewing a document changes
>> the data (e.g. the document's "viewed" property.
>>
>>
>> At present, all data is part of a single Lucene document. The problem
>> is that when any piece of mutable data is updated (this happens
>> relatively frequently), we have to reindex the entire document. We'd
>> like to have two separate indexed Lucene documents per logical
>> document, one containing the immutable data and the other containing
>> the much smaller and more transient mutable data. When the mutable
>> data changes, we can delete that document's mutable Lucene document
>> and index a new one very quickly.
>>
>> There are two major difficulties when actually performing a search, though:
>>
>> 1. We are providing complex queries to retrieve logical documents
>> based on information in either of its Lucene documents. It seems
>> non-trivial to fetch a logical document in a BooleanQuery with
>> Occur.MUST clauses referring to fields in both of the Lucene
>> documents.
>>
>> 2. We need to sort results (logical document IDs) based on fields in
>> either of its Lucene documents.
>>
>> Has anyone done anything like this before? Is there functionality I'm
>> overlooking that could make this easier?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message