lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: Incremental Field Updates
Date Sun, 09 May 2010 07:31:25 GMT
> No, actually, you can update index-only fields also. It all depends on the
> operation that you ask to do on the fields.

I love the level of control this provides, but again, I was talking at
the user level.

> If you want to e.g. remove an entire field w/ such update operation, then it
> becomes more expensive

That's the typical usage scenario, I imagine. When I update a field, I
want to update all of it, not just part of it. No?

(The lower level semantics of twiddling with the postings is poorly
understood by the typical user..)

> We didn't
> face such a scenario though, and I think it's probably a rare one?

As rare as any time you want to update an indexed-only field.  Not a
serious limitation (but perhaps worth noting?)
Perhaps at the API level, you provide an updateIndexedOnlyField that
takes the old value as well as the new value for the field.

Anyway, I think your approach would be a great addition to the
toolkit. Would love to see it even in rough cut form :))

-Babak

On Sun, May 9, 2010 at 12:49 AM, Shai Erera <serera@gmail.com> wrote:
> No, actually, you can update index-only fields also. It all depends on the
> operation that you ask to do on the fields. For example, if the query to
> execute is something like "update all documents w/ tags:ibm -> remove terms
> t1, t2, t3 and add terms t4, t5", then the result of such request would
> dissociate t1-3 from those docs that answer tags:ibm and associate them w/
> t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> following posting updates will be done:
> t1: -1, -4, -10
> t2: -1, -4, -10
> t3: -1, -4, -10
> t4, 1, 4, 10
> t5, 1, 4, 10
> (in addition to whatever other updates that are associated with those
> postings).
>
> At search time, if you search for "t1 OR t2", then the regular t1 and t2
> postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4,
> 10.
>
> If you want to e.g. remove an entire field w/ such update operation, then it
> becomes more expensive, but in general you'd need to iterate over the
> field's terms and dissociate the documents from all the terms. We didn't
> face such a scenario though, and I think it's probably a rare one?
>
> Shai
>
> On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farhang@gmail.com> wrote:
>>
>> Shai,
>>
>> I think this is an interesting approach. I can see how I could
>> [incrementally] update a stored, indexed field this way, but I don't
>> understand how I could update an indexed-only field. Here's why: for a
>> stored (and indexed) field, I can always determine what terms to
>> remove ('-') from the postings, but for an indexed-only field I'd have
>> no [practical] way to know..
>>
>> So under this approach,  I'm thinking at a user level, a Lucene field
>> would be updateable only if it's at least stored.
>>
>> Is that right?
>>
>> -Babak
>>
>> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <serera@gmail.com> wrote:
>> > Yes Mike - I don't know yet if two MPs will be used, one for the stacked
>> > segments and one for the general segments (which will include the
>> > stacked
>> > ones as well) .. feels like one MP should be enough, but this can be
>> > decided
>> > on as we progress.
>> >
>> > This approach allows you to update every term in every already indexed
>> > field, as well as add terms to already indexed fields ... and add
>> > totally
>> > new fields, with lots of text in them. So e.g. there are two neat use
>> > cases
>> > that come to mind:
>> > 1) Allow users to rate search results, favor them etc.
>> > 2) Or even comment them,
>> > I think Google offers the 2nd. Both translate into updating the search
>> > result's already indexed document w/ the new rating, comment etc. w/o
>> > needing to reindex the document.
>> >
>> > I will try to find perf results numbers. It's been long time ago, hope
>> > all
>> > the documents are still where they were :). Indeed, for terms like ACLs,
>> > it
>> > means that each query had to merge dozens of postings lists. For that I
>> > implemented an alternative solution, which uses a payload-like structure
>> > that registers for each document the list of ACLs that are associated
>> > with
>> > it (not as strings, it was more efficient). Then, if the query included
>> > dozens of such terms, I created a filter-like object which for every
>> > matching document by the query checked if it matches the ACLs list of
>> > the
>> > document. This is usually slower, because the ACLs themselves don't
>> > drive
>> > the query, which means more matches will be found. That was a tradeoff
>> > which
>> > one could configure based on the number of such terms in the query, the
>> > number of updated terms etc.
>> >
>> > But in essence you're right - if the solution is generic to cover any
>> > term,
>> > we cannot use such payload-based feature. We might need to merge the
>> > stacked
>> > segments more frequently. This is pending perf testing though.
>> >
>> > Shai
>> >
>> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
>> > <lucene@mikemccandless.com> wrote:
>> >>
>> >> Catching up here :)
>> >>
>> >> This is great stuff Shai -- I like the notion of "negative" postings,
>> >> that "subtract" docs from previous generations as you iterate them.
>> >>
>> >> And I like the term "stacked segments".  This fits very well with
>> >> Lucene's write-once approach, ie, a writer can at any time stack a new
>> >> segment (writes to new files) "over" an old segment, like the layers
>> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
>> >>
>> >> And the merge policy now picks from 2 dimensions right?  Ie it may
>> >> want to simply consolidate stacks on an old segment but otherwise not
>> >> merge that segment (eg for very large segments that have accumulated
>> >> updates), and normal merging would of course consolidate all stacks
>> >> for all segments merged.
>> >>
>> >> Wouldn't this approach conceivably allow you to alter single terms
>> >> within a single field (we'd have to figure out how we'd expose the API
>> >> for this)?  EG if I appended some text to an already-indexed field?
>> >> (In addition to adding a new field to an already indexed doc, or
>> >> replacing an indexed field on a previously indexed doc).
>> >>
>> >> Did you have any hard perf numbers?  Merge sorting N streams is
>> >> surprisingly costly... we may need/want to have a reader pre-merge
>> >> (using up RAM) any "long tail" of stacked segments as long as they are
>> >> small enough...
>> >>
>> >> This sounds great!!
>> >>
>> >> Mike
>> >>
>> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <serera@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > WARNING: following email is a bit long, but I think is worth the
>> >> > reading
>> >> > :)
>> >> >
>> >> > I would like to describe my implementation of incremental field
>> >> > updates
>> >> > in Juru (the former search library I've worked on in IBM). I will
>> >> > later
>> >> > discuss how I think it can be implemented in Lucene.
>> >> >
>> >> > The motivation/requirement came from a doc management system which
>> >> > used
>> >> > Juru as its search component. The system included document libraries
>> >> > where users could create files and upload documents. A user could
>> >> > belong
>> >> > to any number of libraries and complex ACLs model was used (down to
>> >> > the
>> >> > level of the file). ACLs and Folders were modeled as categories in
>> >> > the
>> >> > index (boolean-like terms). Files and folders could be moved around
>> >> > and
>> >> > access to a library/folder/file could be granted/revoked at any given
>> >> > time. Therefore, such updates usually affected hundreds (and
>> >> > thousands)
>> >> > of documents. Overall, the index managed millions of documents, tens
>> >> > of
>> >> > thousands of libraries and hundreds of thousands of ACLs (large
>> >> > organization :)). To get a rough understanding on the number of such
>> >> > updates - every several minutes, tens of thousands of documents were
>> >> > updated due to such changes only (in addition to the regular content
>> >> > updates).
>> >> >
>> >> > We were asked to support requests in the following form: "update all
>> >> > docs
>> >> > that match <criteria> --> do <operation>" where:
>> >> > * <criteria> was "a single doc", "docs belonging to a category"
and
>> >> > "docs
>> >> > belonging to a set of categories".
>> >> > * <operation> was "add categories NEW" (NEW might not even exist
in
>> >> > the
>> >> > index yet, or already associated w/ the document), "remove categories
>> >> > OLD"
>> >> > (irregardless if the docs were already associated w/ OLD or not) and
>> >> > "remove all OLD and add all NEW".
>> >> >
>> >> > The solution I've implemented to support these requests turned out
to
>> >> > actually allow you to update every term (!) in the index: suppose
>> >> > that
>> >> > you have a table, which recorded tuples like <docid, term, +/->.
The
>> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the
term
>> >> > "ibm",
>> >> > and the record <1, "hp", '-'> means that the same document is
not
>> >> > associated w/ the word "hp". Then, you could very easily ask for all
>> >> > documents that are assoicated w/ "hp", and the result would not
>> >> > include
>> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id
>> >> > the
>> >> > document received.
>> >> >
>> >> > I've kept two types of postings in the index: regular and updates.
>> >> > Taking the above examples, "ibm" regular posting looked like <"ibm",
>> >> > 1,
>> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm",
>> >> > +2,
>> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly
for
>> >> > "hp".
>> >> >
>> >> > During search time, when a query with the word "ibm" was submitted,
I
>> >> > create a virtual posting which reads from both the regular and the
>> >> > updates, and merges them on the fly according to the +/- signs. Since
>> >> > both postings are sorted in ascending order, the merge is very
>> >> > efficient, and query time is hardly affected.
>> >> >
>> >> > Those postings are merged from time to time in a process that is
>> >> > similar
>> >> > to how Lucene works today, which keeps the update postings relatively
>> >> > small and manageable.
>> >> >
>> >> > Now here comes the fun part - how I think it can be implemented in
>> >> > Lucene !
>> >> >
>> >> > To be honest, this sat on my TODO list for a very long time and only
>> >> > a
>> >> > couple of months ago I figured out how to implement it in Lucene. The
>> >> > main difficulty I had was around the difference between the
>> >> > write-once
>> >> > policy in Juru and Lucene - in Lucene, once a segment is written, it
>> >> > cannot be changed. BUT, I've only recently realized that this isn't
>> >> > exactly true, because deleted docs do change existing segments. The
>> >> > deletes are kept in a separate file to the segment (.del) and have
>> >> > their
>> >> > own generation. Deletes, as I understood then, and Grant helped me
>> >> > term
>> >> > them better, can be defined as "Stacked Segments" - they add data to
>> >> > a
>> >> > segment, which from time to time are integrated into the segment
>> >> > (unlike
>> >> > Photoshop Layers, but my understanding of Photoshop is limited). And
>> >> > the
>> >> > Lucene engine knows how to combine the two, giving precedence to the
>> >> > deletes.
>> >> >
>> >> > By introducing an "Updates Stacked Segment", we can encode postings
>> >> > w/
>> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can
>> >> > create a variation which merges the two lists. When segments are
>> >> > merged,
>> >> > the updates will be merged into the regular postings (just like
>> >> > deletes)
>> >> > and thus will be gone. In addition, this plays very nicely with
>> >> > readers
>> >> > that are currently reading the index, as well as we can have
>> >> > generations
>> >> > for the updates - really like deletes !
>> >> >
>> >> > I think that Lucene's architecture allows for such a solution very
>> >> > cleanly and nicely (and I believe flex makes it even easier). We can
>> >> > (later, after you've digested the idea) discuss whether this should
>> >> > be
>> >> > built into the current IW, or an extension like UpdateableIW. The API
>> >> > I've been thinking about should really be like deletes, allowing to
>> >> > update docs based on Term or Query. I defer the API discussion for
>> >> > later
>> >> > for now.
>> >> >
>> >> > As for stored fields, this was a real challenge to support in Juru,
>> >> > but
>> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this
>> >> > should
>> >> > be much easier - adding stacked stored fields ...
>> >> >
>> >> > As you've noticed, the update postings are not DGap encoded, and sign
>> >> > needs to be preserved. While I haven't implemented it in Juru, I
>> >> > think
>> >> > that perhaps this can be improved by keeping the '-' and '+' lists
>> >> > separated. We will need to register somewhere which came before which
>> >> > because order matters a lot here (and I'm not talking about
>> >> > concurrency
>> >> > - simple update instructions order). I have some idea how this can
be
>> >> > achieved, but I refrain from describing it now, to not make this
>> >> > email
>> >> > even longer :).
>> >> >
>> >> > I've mentioned that this approach can be applied to any term and not
>> >> > just categories under some circumstances. Basically, as soon as you
>> >> > update a term, its DF is no longer true, unless you are able to take
>> >> > the
>> >> > updates into account. We can defer the discussion on that, but
>> >> > clearly
>> >> > for many fields, incrementally update them should not affect
>> >> > precision,
>> >> > as they're not used for that type of scoring ... Maybe, by keeping
>> >> > separate '+' and '-' lists we can compute statistics precisely. And
I
>> >> > haven't given much thought yet to how this and Mike's flex scoring
>> >> > will
>> >> > be integrated.
>> >> >
>> >> > BTW, a word on Parallel Indexing - the two are completely orthogonal.
>> >> > Once PI is introduced, one can index all the updateable fields in a
>> >> > dedicated slice, for perhaps improving search performance for slices
>> >> > that are not updateable (not involving code which attempts to read
>> >> > and
>> >> > merge update and regular lists on the fly). Also, incremental field
>> >> > updates support all of PI's scenarios, even though some will be done
>> >> > more efficiently w/ PI. But this too is a matter for a separate
>> >> > discussion :).
>> >> >
>> >> > That's it ! I believe I've given you all the details I have about the
>> >> > approach and high level proposed solution for Lucene. Perhaps some
>> >> > details slipped my mind, but if you ask the right questions, I'm sure
>> >> > I'll be able to answer them :). I would like to emphasize that since
>> >> > this was already implemented (in Juru) - this is more than just a "I
>> >> > think this approach can work" proposal ...
>> >> >
>> >> > I would appreciate your comments on this. I would like to start
>> >> > implementing it soon, and so as a first step, please share your
>> >> > comments
>> >> > on the overall approach. I'll then write a more detailed description
>> >> > on
>> >> > how I think to impl it in Lucene (been spending some time on that),
>> >> > and
>> >> > we can have more detailed (and fun) discussions on the low level
>> >> > details.
>> >> >
>> >> > Shai
>> >> >
>> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farhang@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Good point. I meant the model at the document level: i.e. what
>> >> >> milestones does a document go through in its life cycle. Today:
>> >> >>
>> >> >> created --> deleted
>> >> >>
>> >> >> With incremental updates:
>> >> >>
>> >> >> created --> update1 --> update2 --> deleted
>> >> >>
>> >> >> I think what I'm trying to say is that this second threaded sequence
>> >> >> of state changes seems intuitively more fragile under concurrent
>> >> >> scenarios.  So for example, in a lock-free design, the system
would
>> >> >> also have to anticipate the following sequence of events:
>> >> >>
>> >> >> created --> update1 --> deleted --> update2
>> >> >>
>> >> >> and consider update2 a null op.  I'm imagining there are other
cases
>> >> >> that I can't think of..
>> >> >>
>> >> >> -Babak
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> > write once, plus the option to the app to keep multiple commit
>> >> >> > points
>> >> >> > around (by customizing the deletion policy).
>> >> >> >
>> >> >> > Actually order of operations / commits very much matters in
Lucene
>> >> >> > today.
>> >> >> >
>> >> >> > Deletions are not idempotent: if you add a doc w/ term X,
delete
>> >> >> > by
>> >> >> > term X, add a new doc with term X... that's very different
than if
>> >> >> > you
>> >> >> > moved the delete op to the end.  Ie the deletion only applies
to
>> >> >> > the
>> >> >> > docs added before it.
>> >> >> >
>> >> >> > Mike
>> >> >> >
>> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <farhang@gmail.com>
>> >> >> > wrote:
>> >> >> >> Sure. Because of the write once principle.  But at some
cost
>> >> >> >> (duplicated data). I was just agreeing that it would not
be a
>> >> >> >> good
>> >> >> >> idea to bake in version-ing by keeping the layers around
forever
>> >> >> >> in
>> >> >> >> a
>> >> >> >> merged index; I wasn't keying in on transactions per se.
>> >> >> >>
>> >> >> >> Speaking of transactions: I'm not sure if we should worry
about
>> >> >> >> this
>> >> >> >> much yet, but with "updates" the order of the transaction
commits
>> >> >> >> seems important. I think commit order is less important
today in
>> >> >> >> Lucene because its model supports only 2 types of events:
>> >> >> >> document
>> >> >> >> creation--which only happens once, and document deletion,
which
>> >> >> >> is
>> >> >> >> idempotent.  What do you think? Will commits have to
be ordered
>> >> >> >> if
>> >> >> >> we
>> >> >> >> introduce updates?  Or does the onus of maintaining order
fall on
>> >> >> >> the
>> >> >> >> application?
>> >> >> >>
>> >> >> >> -Babak
>> >> >> >>
>> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
>> >> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
>> >> >> >>> <farhang@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>>>> I think they get merged in by the merger,
ideally in the
>> >> >> >>>>> background.
>> >> >> >>>>
>> >> >> >>>> That sounds sensible. (In other words, we wont
concern
>> >> >> >>>> ourselves
>> >> >> >>>> with
>> >> >> >>>> roll backs--something possible while a "layer"
is still
>> >> >> >>>> around.)
>> >> >> >>>
>> >> >> >>> Actually roll backs would still be very possible even
if layers
>> >> >> >>> are
>> >> >> >>> merged.
>> >> >> >>>
>> >> >> >>> Ie, one could keep multiple commits around, and the
older
>> >> >> >>> commits
>> >> >> >>> would still be referring to the old postings + layers,
keeping
>> >> >> >>> them
>> >> >> >>> alive.
>> >> >> >>>
>> >> >> >>> Lucene would still be transactional with such an approach.
>> >> >> >>>
>> >> >> >>> Mike
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> ---------------------------------------------------------------------
>> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >>
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message