lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: Incremental Field Updates
Date Sun, 09 May 2010 06:39:37 GMT
Shai,

I think this is an interesting approach. I can see how I could
[incrementally] update a stored, indexed field this way, but I don't
understand how I could update an indexed-only field. Here's why: for a
stored (and indexed) field, I can always determine what terms to
remove ('-') from the postings, but for an indexed-only field I'd have
no [practical] way to know..

So under this approach,  I'm thinking at a user level, a Lucene field
would be updateable only if it's at least stored.

Is that right?

-Babak

On Wed, May 5, 2010 at 11:17 AM, Shai Erera <serera@gmail.com> wrote:
> Yes Mike - I don't know yet if two MPs will be used, one for the stacked
> segments and one for the general segments (which will include the stacked
> ones as well) .. feels like one MP should be enough, but this can be decided
> on as we progress.
>
> This approach allows you to update every term in every already indexed
> field, as well as add terms to already indexed fields ... and add totally
> new fields, with lots of text in them. So e.g. there are two neat use cases
> that come to mind:
> 1) Allow users to rate search results, favor them etc.
> 2) Or even comment them,
> I think Google offers the 2nd. Both translate into updating the search
> result's already indexed document w/ the new rating, comment etc. w/o
> needing to reindex the document.
>
> I will try to find perf results numbers. It's been long time ago, hope all
> the documents are still where they were :). Indeed, for terms like ACLs, it
> means that each query had to merge dozens of postings lists. For that I
> implemented an alternative solution, which uses a payload-like structure
> that registers for each document the list of ACLs that are associated with
> it (not as strings, it was more efficient). Then, if the query included
> dozens of such terms, I created a filter-like object which for every
> matching document by the query checked if it matches the ACLs list of the
> document. This is usually slower, because the ACLs themselves don't drive
> the query, which means more matches will be found. That was a tradeoff which
> one could configure based on the number of such terms in the query, the
> number of updated terms etc.
>
> But in essence you're right - if the solution is generic to cover any term,
> we cannot use such payload-based feature. We might need to merge the stacked
> segments more frequently. This is pending perf testing though.
>
> Shai
>
> On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> Catching up here :)
>>
>> This is great stuff Shai -- I like the notion of "negative" postings,
>> that "subtract" docs from previous generations as you iterate them.
>>
>> And I like the term "stacked segments".  This fits very well with
>> Lucene's write-once approach, ie, a writer can at any time stack a new
>> segment (writes to new files) "over" an old segment, like the layers
>> in photoshop.  A reader merges all stacks on-the-fly when reading.
>>
>> And the merge policy now picks from 2 dimensions right?  Ie it may
>> want to simply consolidate stacks on an old segment but otherwise not
>> merge that segment (eg for very large segments that have accumulated
>> updates), and normal merging would of course consolidate all stacks
>> for all segments merged.
>>
>> Wouldn't this approach conceivably allow you to alter single terms
>> within a single field (we'd have to figure out how we'd expose the API
>> for this)?  EG if I appended some text to an already-indexed field?
>> (In addition to adding a new field to an already indexed doc, or
>> replacing an indexed field on a previously indexed doc).
>>
>> Did you have any hard perf numbers?  Merge sorting N streams is
>> surprisingly costly... we may need/want to have a reader pre-merge
>> (using up RAM) any "long tail" of stacked segments as long as they are
>> small enough...
>>
>> This sounds great!!
>>
>> Mike
>>
>> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <serera@gmail.com> wrote:
>> > Hi,
>> >
>> > WARNING: following email is a bit long, but I think is worth the reading
>> > :)
>> >
>> > I would like to describe my implementation of incremental field updates
>> > in Juru (the former search library I've worked on in IBM). I will later
>> > discuss how I think it can be implemented in Lucene.
>> >
>> > The motivation/requirement came from a doc management system which used
>> > Juru as its search component. The system included document libraries
>> > where users could create files and upload documents. A user could belong
>> > to any number of libraries and complex ACLs model was used (down to the
>> > level of the file). ACLs and Folders were modeled as categories in the
>> > index (boolean-like terms). Files and folders could be moved around and
>> > access to a library/folder/file could be granted/revoked at any given
>> > time. Therefore, such updates usually affected hundreds (and thousands)
>> > of documents. Overall, the index managed millions of documents, tens of
>> > thousands of libraries and hundreds of thousands of ACLs (large
>> > organization :)). To get a rough understanding on the number of such
>> > updates - every several minutes, tens of thousands of documents were
>> > updated due to such changes only (in addition to the regular content
>> > updates).
>> >
>> > We were asked to support requests in the following form: "update all
>> > docs
>> > that match <criteria> --> do <operation>" where:
>> > * <criteria> was "a single doc", "docs belonging to a category" and
>> > "docs
>> > belonging to a set of categories".
>> > * <operation> was "add categories NEW" (NEW might not even exist in the
>> > index yet, or already associated w/ the document), "remove categories
>> > OLD"
>> > (irregardless if the docs were already associated w/ OLD or not) and
>> > "remove all OLD and add all NEW".
>> >
>> > The solution I've implemented to support these requests turned out to
>> > actually allow you to update every term (!) in the index: suppose that
>> > you have a table, which recorded tuples like <docid, term, +/->. The
>> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
>> > and the record <1, "hp", '-'> means that the same document is not
>> > associated w/ the word "hp". Then, you could very easily ask for all
>> > documents that are assoicated w/ "hp", and the result would not include
>> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
>> > document received.
>> >
>> > I've kept two types of postings in the index: regular and updates.
>> > Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
>> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
>> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
>> > "hp".
>> >
>> > During search time, when a query with the word "ibm" was submitted, I
>> > create a virtual posting which reads from both the regular and the
>> > updates, and merges them on the fly according to the +/- signs. Since
>> > both postings are sorted in ascending order, the merge is very
>> > efficient, and query time is hardly affected.
>> >
>> > Those postings are merged from time to time in a process that is similar
>> > to how Lucene works today, which keeps the update postings relatively
>> > small and manageable.
>> >
>> > Now here comes the fun part - how I think it can be implemented in
>> > Lucene !
>> >
>> > To be honest, this sat on my TODO list for a very long time and only a
>> > couple of months ago I figured out how to implement it in Lucene. The
>> > main difficulty I had was around the difference between the write-once
>> > policy in Juru and Lucene - in Lucene, once a segment is written, it
>> > cannot be changed. BUT, I've only recently realized that this isn't
>> > exactly true, because deleted docs do change existing segments. The
>> > deletes are kept in a separate file to the segment (.del) and have their
>> > own generation. Deletes, as I understood then, and Grant helped me term
>> > them better, can be defined as "Stacked Segments" - they add data to a
>> > segment, which from time to time are integrated into the segment (unlike
>> > Photoshop Layers, but my understanding of Photoshop is limited). And the
>> > Lucene engine knows how to combine the two, giving precedence to the
>> > deletes.
>> >
>> > By introducing an "Updates Stacked Segment", we can encode postings w/
>> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
>> > create a variation which merges the two lists. When segments are merged,
>> > the updates will be merged into the regular postings (just like deletes)
>> > and thus will be gone. In addition, this plays very nicely with readers
>> > that are currently reading the index, as well as we can have generations
>> > for the updates - really like deletes !
>> >
>> > I think that Lucene's architecture allows for such a solution very
>> > cleanly and nicely (and I believe flex makes it even easier). We can
>> > (later, after you've digested the idea) discuss whether this should be
>> > built into the current IW, or an extension like UpdateableIW. The API
>> > I've been thinking about should really be like deletes, allowing to
>> > update docs based on Term or Query. I defer the API discussion for later
>> > for now.
>> >
>> > As for stored fields, this was a real challenge to support in Juru, but
>> > I think that w/ "Stacked Segments" and Lucene's architecture, this
>> > should
>> > be much easier - adding stacked stored fields ...
>> >
>> > As you've noticed, the update postings are not DGap encoded, and sign
>> > needs to be preserved. While I haven't implemented it in Juru, I think
>> > that perhaps this can be improved by keeping the '-' and '+' lists
>> > separated. We will need to register somewhere which came before which
>> > because order matters a lot here (and I'm not talking about concurrency
>> > - simple update instructions order). I have some idea how this can be
>> > achieved, but I refrain from describing it now, to not make this email
>> > even longer :).
>> >
>> > I've mentioned that this approach can be applied to any term and not
>> > just categories under some circumstances. Basically, as soon as you
>> > update a term, its DF is no longer true, unless you are able to take the
>> > updates into account. We can defer the discussion on that, but clearly
>> > for many fields, incrementally update them should not affect precision,
>> > as they're not used for that type of scoring ... Maybe, by keeping
>> > separate '+' and '-' lists we can compute statistics precisely. And I
>> > haven't given much thought yet to how this and Mike's flex scoring will
>> > be integrated.
>> >
>> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
>> > Once PI is introduced, one can index all the updateable fields in a
>> > dedicated slice, for perhaps improving search performance for slices
>> > that are not updateable (not involving code which attempts to read and
>> > merge update and regular lists on the fly). Also, incremental field
>> > updates support all of PI's scenarios, even though some will be done
>> > more efficiently w/ PI. But this too is a matter for a separate
>> > discussion :).
>> >
>> > That's it ! I believe I've given you all the details I have about the
>> > approach and high level proposed solution for Lucene. Perhaps some
>> > details slipped my mind, but if you ask the right questions, I'm sure
>> > I'll be able to answer them :). I would like to emphasize that since
>> > this was already implemented (in Juru) - this is more than just a "I
>> > think this approach can work" proposal ...
>> >
>> > I would appreciate your comments on this. I would like to start
>> > implementing it soon, and so as a first step, please share your comments
>> > on the overall approach. I'll then write a more detailed description on
>> > how I think to impl it in Lucene (been spending some time on that), and
>> > we can have more detailed (and fun) discussions on the low level
>> > details.
>> >
>> > Shai
>> >
>> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farhang@gmail.com> wrote:
>> >>
>> >> Good point. I meant the model at the document level: i.e. what
>> >> milestones does a document go through in its life cycle. Today:
>> >>
>> >> created --> deleted
>> >>
>> >> With incremental updates:
>> >>
>> >> created --> update1 --> update2 --> deleted
>> >>
>> >> I think what I'm trying to say is that this second threaded sequence
>> >> of state changes seems intuitively more fragile under concurrent
>> >> scenarios.  So for example, in a lock-free design, the system would
>> >> also have to anticipate the following sequence of events:
>> >>
>> >> created --> update1 --> deleted --> update2
>> >>
>> >> and consider update2 a null op.  I'm imagining there are other cases
>> >> that I can't think of..
>> >>
>> >> -Babak
>> >>
>> >>
>> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> <lucene@mikemccandless.com> wrote:
>> >> > write once, plus the option to the app to keep multiple commit points
>> >> > around (by customizing the deletion policy).
>> >> >
>> >> > Actually order of operations / commits very much matters in Lucene
>> >> > today.
>> >> >
>> >> > Deletions are not idempotent: if you add a doc w/ term X, delete by
>> >> > term X, add a new doc with term X... that's very different than if
>> >> > you
>> >> > moved the delete op to the end.  Ie the deletion only applies to the
>> >> > docs added before it.
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <farhang@gmail.com>
>> >> > wrote:
>> >> >> Sure. Because of the write once principle.  But at some cost
>> >> >> (duplicated data). I was just agreeing that it would not be a good
>> >> >> idea to bake in version-ing by keeping the layers around forever
in
>> >> >> a
>> >> >> merged index; I wasn't keying in on transactions per se.
>> >> >>
>> >> >> Speaking of transactions: I'm not sure if we should worry about
this
>> >> >> much yet, but with "updates" the order of the transaction commits
>> >> >> seems important. I think commit order is less important today in
>> >> >> Lucene because its model supports only 2 types of events: document
>> >> >> creation--which only happens once, and document deletion, which
is
>> >> >> idempotent.  What do you think? Will commits have to be ordered
if
>> >> >> we
>> >> >> introduce updates?  Or does the onus of maintaining order fall
on
>> >> >> the
>> >> >> application?
>> >> >>
>> >> >> -Babak
>> >> >>
>> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> >> <lucene@mikemccandless.com> wrote:
>> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <farhang@gmail.com>
>> >> >>> wrote:
>> >> >>>>> I think they get merged in by the merger, ideally in
the
>> >> >>>>> background.
>> >> >>>>
>> >> >>>> That sounds sensible. (In other words, we wont concern
ourselves
>> >> >>>> with
>> >> >>>> roll backs--something possible while a "layer" is still
around.)
>> >> >>>
>> >> >>> Actually roll backs would still be very possible even if layers
are
>> >> >>> merged.
>> >> >>>
>> >> >>> Ie, one could keep multiple commits around, and the older commits
>> >> >>> would still be referring to the old postings + layers, keeping
them
>> >> >>> alive.
>> >> >>>
>> >> >>> Lucene would still be transactional with such an approach.
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message