lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Incremental Field Updates
Date Sun, 09 May 2010 11:38:52 GMT
>
> When I update a field, I
> want to update all of it, not just part of it. No?
>

Well ... might be. But the most common case for us are fields to which we
want to add data or remove. For example ACLs - you don't want to replace the
entire field w/ the document, but simply to add/remove access for certain
people/groups. Same goes for "social" fields, like tags, ratings, bookmarks
etc. - the granularity of the update is to associate/dissociate a particular
value w/ the field + doc, and not update the entire field.

Shai

On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <farhang@gmail.com> wrote:

> > No, actually, you can update index-only fields also. It all depends on
> the
> > operation that you ask to do on the fields.
>
> I love the level of control this provides, but again, I was talking at
> the user level.
>
> > If you want to e.g. remove an entire field w/ such update operation, then
> it
> > becomes more expensive
>
> That's the typical usage scenario, I imagine. When I update a field, I
> want to update all of it, not just part of it. No?
>
> (The lower level semantics of twiddling with the postings is poorly
> understood by the typical user..)
>
> > We didn't
> > face such a scenario though, and I think it's probably a rare one?
>
> As rare as any time you want to update an indexed-only field.  Not a
> serious limitation (but perhaps worth noting?)
> Perhaps at the API level, you provide an updateIndexedOnlyField that
> takes the old value as well as the new value for the field.
>
> Anyway, I think your approach would be a great addition to the
> toolkit. Would love to see it even in rough cut form :))
>
> -Babak
>
> On Sun, May 9, 2010 at 12:49 AM, Shai Erera <serera@gmail.com> wrote:
> > No, actually, you can update index-only fields also. It all depends on
> the
> > operation that you ask to do on the fields. For example, if the query to
> > execute is something like "update all documents w/ tags:ibm -> remove
> terms
> > t1, t2, t3 and add terms t4, t5", then the result of such request would
> > dissociate t1-3 from those docs that answer tags:ibm and associate them
> w/
> > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> > following posting updates will be done:
> > t1: -1, -4, -10
> > t2: -1, -4, -10
> > t3: -1, -4, -10
> > t4, 1, 4, 10
> > t5, 1, 4, 10
> > (in addition to whatever other updates that are associated with those
> > postings).
> >
> > At search time, if you search for "t1 OR t2", then the regular t1 and t2
> > postings will be merged on-the-fly w/ the updated ones to remove docs 1,
> 4,
> > 10.
> >
> > If you want to e.g. remove an entire field w/ such update operation, then
> it
> > becomes more expensive, but in general you'd need to iterate over the
> > field's terms and dissociate the documents from all the terms. We didn't
> > face such a scenario though, and I think it's probably a rare one?
> >
> > Shai
> >
> > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farhang@gmail.com> wrote:
> >>
> >> Shai,
> >>
> >> I think this is an interesting approach. I can see how I could
> >> [incrementally] update a stored, indexed field this way, but I don't
> >> understand how I could update an indexed-only field. Here's why: for a
> >> stored (and indexed) field, I can always determine what terms to
> >> remove ('-') from the postings, but for an indexed-only field I'd have
> >> no [practical] way to know..
> >>
> >> So under this approach,  I'm thinking at a user level, a Lucene field
> >> would be updateable only if it's at least stored.
> >>
> >> Is that right?
> >>
> >> -Babak
> >>
> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <serera@gmail.com> wrote:
> >> > Yes Mike - I don't know yet if two MPs will be used, one for the
> stacked
> >> > segments and one for the general segments (which will include the
> >> > stacked
> >> > ones as well) .. feels like one MP should be enough, but this can be
> >> > decided
> >> > on as we progress.
> >> >
> >> > This approach allows you to update every term in every already indexed
> >> > field, as well as add terms to already indexed fields ... and add
> >> > totally
> >> > new fields, with lots of text in them. So e.g. there are two neat use
> >> > cases
> >> > that come to mind:
> >> > 1) Allow users to rate search results, favor them etc.
> >> > 2) Or even comment them,
> >> > I think Google offers the 2nd. Both translate into updating the search
> >> > result's already indexed document w/ the new rating, comment etc. w/o
> >> > needing to reindex the document.
> >> >
> >> > I will try to find perf results numbers. It's been long time ago, hope
> >> > all
> >> > the documents are still where they were :). Indeed, for terms like
> ACLs,
> >> > it
> >> > means that each query had to merge dozens of postings lists. For that
> I
> >> > implemented an alternative solution, which uses a payload-like
> structure
> >> > that registers for each document the list of ACLs that are associated
> >> > with
> >> > it (not as strings, it was more efficient). Then, if the query
> included
> >> > dozens of such terms, I created a filter-like object which for every
> >> > matching document by the query checked if it matches the ACLs list of
> >> > the
> >> > document. This is usually slower, because the ACLs themselves don't
> >> > drive
> >> > the query, which means more matches will be found. That was a tradeoff
> >> > which
> >> > one could configure based on the number of such terms in the query,
> the
> >> > number of updated terms etc.
> >> >
> >> > But in essence you're right - if the solution is generic to cover any
> >> > term,
> >> > we cannot use such payload-based feature. We might need to merge the
> >> > stacked
> >> > segments more frequently. This is pending perf testing though.
> >> >
> >> > Shai
> >> >
> >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> >> > <lucene@mikemccandless.com> wrote:
> >> >>
> >> >> Catching up here :)
> >> >>
> >> >> This is great stuff Shai -- I like the notion of "negative" postings,
> >> >> that "subtract" docs from previous generations as you iterate them.
> >> >>
> >> >> And I like the term "stacked segments".  This fits very well with
> >> >> Lucene's write-once approach, ie, a writer can at any time stack a
> new
> >> >> segment (writes to new files) "over" an old segment, like the layers
> >> >> in photoshop.  A reader merges all stacks on-the-fly when reading.
> >> >>
> >> >> And the merge policy now picks from 2 dimensions right?  Ie it may
> >> >> want to simply consolidate stacks on an old segment but otherwise not
> >> >> merge that segment (eg for very large segments that have accumulated
> >> >> updates), and normal merging would of course consolidate all stacks
> >> >> for all segments merged.
> >> >>
> >> >> Wouldn't this approach conceivably allow you to alter single terms
> >> >> within a single field (we'd have to figure out how we'd expose the
> API
> >> >> for this)?  EG if I appended some text to an already-indexed field?
> >> >> (In addition to adding a new field to an already indexed doc, or
> >> >> replacing an indexed field on a previously indexed doc).
> >> >>
> >> >> Did you have any hard perf numbers?  Merge sorting N streams is
> >> >> surprisingly costly... we may need/want to have a reader pre-merge
> >> >> (using up RAM) any "long tail" of stacked segments as long as they
> are
> >> >> small enough...
> >> >>
> >> >> This sounds great!!
> >> >>
> >> >> Mike
> >> >>
> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <serera@gmail.com>
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > WARNING: following email is a bit long, but I think is worth the
> >> >> > reading
> >> >> > :)
> >> >> >
> >> >> > I would like to describe my implementation of incremental field
> >> >> > updates
> >> >> > in Juru (the former search library I've worked on in IBM). I will
> >> >> > later
> >> >> > discuss how I think it can be implemented in Lucene.
> >> >> >
> >> >> > The motivation/requirement came from a doc management system which
> >> >> > used
> >> >> > Juru as its search component. The system included document
> libraries
> >> >> > where users could create files and upload documents. A user could
> >> >> > belong
> >> >> > to any number of libraries and complex ACLs model was used (down
to
> >> >> > the
> >> >> > level of the file). ACLs and Folders were modeled as categories
in
> >> >> > the
> >> >> > index (boolean-like terms). Files and folders could be moved around
> >> >> > and
> >> >> > access to a library/folder/file could be granted/revoked at any
> given
> >> >> > time. Therefore, such updates usually affected hundreds (and
> >> >> > thousands)
> >> >> > of documents. Overall, the index managed millions of documents,
> tens
> >> >> > of
> >> >> > thousands of libraries and hundreds of thousands of ACLs (large
> >> >> > organization :)). To get a rough understanding on the number of
> such
> >> >> > updates - every several minutes, tens of thousands of documents
> were
> >> >> > updated due to such changes only (in addition to the regular
> content
> >> >> > updates).
> >> >> >
> >> >> > We were asked to support requests in the following form: "update
> all
> >> >> > docs
> >> >> > that match <criteria> --> do <operation>" where:
> >> >> > * <criteria> was "a single doc", "docs belonging to a category"
and
> >> >> > "docs
> >> >> > belonging to a set of categories".
> >> >> > * <operation> was "add categories NEW" (NEW might not even
exist in
> >> >> > the
> >> >> > index yet, or already associated w/ the document), "remove
> categories
> >> >> > OLD"
> >> >> > (irregardless if the docs were already associated w/ OLD or not)
> and
> >> >> > "remove all OLD and add all NEW".
> >> >> >
> >> >> > The solution I've implemented to support these requests turned
out
> to
> >> >> > actually allow you to update every term (!) in the index: suppose
> >> >> > that
> >> >> > you have a table, which recorded tuples like <docid, term,
+/->.
> The
> >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/
the term
> >> >> > "ibm",
> >> >> > and the record <1, "hp", '-'> means that the same document
is not
> >> >> > associated w/ the word "hp". Then, you could very easily ask for
> all
> >> >> > documents that are assoicated w/ "hp", and the result would not
> >> >> > include
> >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal
id
> >> >> > the
> >> >> > document received.
> >> >> >
> >> >> > I've kept two types of postings in the index: regular and updates.
> >> >> > Taking the above examples, "ibm" regular posting looked like
> <"ibm",
> >> >> > 1,
> >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like
<"ibm",
> >> >> > +2,
> >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly
> for
> >> >> > "hp".
> >> >> >
> >> >> > During search time, when a query with the word "ibm" was submitted,
> I
> >> >> > create a virtual posting which reads from both the regular and
the
> >> >> > updates, and merges them on the fly according to the +/- signs.
> Since
> >> >> > both postings are sorted in ascending order, the merge is very
> >> >> > efficient, and query time is hardly affected.
> >> >> >
> >> >> > Those postings are merged from time to time in a process that
is
> >> >> > similar
> >> >> > to how Lucene works today, which keeps the update postings
> relatively
> >> >> > small and manageable.
> >> >> >
> >> >> > Now here comes the fun part - how I think it can be implemented
in
> >> >> > Lucene !
> >> >> >
> >> >> > To be honest, this sat on my TODO list for a very long time and
> only
> >> >> > a
> >> >> > couple of months ago I figured out how to implement it in Lucene.
> The
> >> >> > main difficulty I had was around the difference between the
> >> >> > write-once
> >> >> > policy in Juru and Lucene - in Lucene, once a segment is written,
> it
> >> >> > cannot be changed. BUT, I've only recently realized that this
isn't
> >> >> > exactly true, because deleted docs do change existing segments.
The
> >> >> > deletes are kept in a separate file to the segment (.del) and
have
> >> >> > their
> >> >> > own generation. Deletes, as I understood then, and Grant helped
me
> >> >> > term
> >> >> > them better, can be defined as "Stacked Segments" - they add data
> to
> >> >> > a
> >> >> > segment, which from time to time are integrated into the segment
> >> >> > (unlike
> >> >> > Photoshop Layers, but my understanding of Photoshop is limited).
> And
> >> >> > the
> >> >> > Lucene engine knows how to combine the two, giving precedence
to
> the
> >> >> > deletes.
> >> >> >
> >> >> > By introducing an "Updates Stacked Segment", we can encode postings
> >> >> > w/
> >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we
can
> >> >> > create a variation which merges the two lists. When segments are
> >> >> > merged,
> >> >> > the updates will be merged into the regular postings (just like
> >> >> > deletes)
> >> >> > and thus will be gone. In addition, this plays very nicely with
> >> >> > readers
> >> >> > that are currently reading the index, as well as we can have
> >> >> > generations
> >> >> > for the updates - really like deletes !
> >> >> >
> >> >> > I think that Lucene's architecture allows for such a solution
very
> >> >> > cleanly and nicely (and I believe flex makes it even easier).
We
> can
> >> >> > (later, after you've digested the idea) discuss whether this should
> >> >> > be
> >> >> > built into the current IW, or an extension like UpdateableIW.
The
> API
> >> >> > I've been thinking about should really be like deletes, allowing
to
> >> >> > update docs based on Term or Query. I defer the API discussion
for
> >> >> > later
> >> >> > for now.
> >> >> >
> >> >> > As for stored fields, this was a real challenge to support in
Juru,
> >> >> > but
> >> >> > I think that w/ "Stacked Segments" and Lucene's architecture,
this
> >> >> > should
> >> >> > be much easier - adding stacked stored fields ...
> >> >> >
> >> >> > As you've noticed, the update postings are not DGap encoded, and
> sign
> >> >> > needs to be preserved. While I haven't implemented it in Juru,
I
> >> >> > think
> >> >> > that perhaps this can be improved by keeping the '-' and '+' lists
> >> >> > separated. We will need to register somewhere which came before
> which
> >> >> > because order matters a lot here (and I'm not talking about
> >> >> > concurrency
> >> >> > - simple update instructions order). I have some idea how this
can
> be
> >> >> > achieved, but I refrain from describing it now, to not make this
> >> >> > email
> >> >> > even longer :).
> >> >> >
> >> >> > I've mentioned that this approach can be applied to any term and
> not
> >> >> > just categories under some circumstances. Basically, as soon as
you
> >> >> > update a term, its DF is no longer true, unless you are able to
> take
> >> >> > the
> >> >> > updates into account. We can defer the discussion on that, but
> >> >> > clearly
> >> >> > for many fields, incrementally update them should not affect
> >> >> > precision,
> >> >> > as they're not used for that type of scoring ... Maybe, by keeping
> >> >> > separate '+' and '-' lists we can compute statistics precisely.
And
> I
> >> >> > haven't given much thought yet to how this and Mike's flex scoring
> >> >> > will
> >> >> > be integrated.
> >> >> >
> >> >> > BTW, a word on Parallel Indexing - the two are completely
> orthogonal.
> >> >> > Once PI is introduced, one can index all the updateable fields
in a
> >> >> > dedicated slice, for perhaps improving search performance for
> slices
> >> >> > that are not updateable (not involving code which attempts to
read
> >> >> > and
> >> >> > merge update and regular lists on the fly). Also, incremental
field
> >> >> > updates support all of PI's scenarios, even though some will be
> done
> >> >> > more efficiently w/ PI. But this too is a matter for a separate
> >> >> > discussion :).
> >> >> >
> >> >> > That's it ! I believe I've given you all the details I have about
> the
> >> >> > approach and high level proposed solution for Lucene. Perhaps
some
> >> >> > details slipped my mind, but if you ask the right questions, I'm
> sure
> >> >> > I'll be able to answer them :). I would like to emphasize that
> since
> >> >> > this was already implemented (in Juru) - this is more than just
a
> "I
> >> >> > think this approach can work" proposal ...
> >> >> >
> >> >> > I would appreciate your comments on this. I would like to start
> >> >> > implementing it soon, and so as a first step, please share your
> >> >> > comments
> >> >> > on the overall approach. I'll then write a more detailed
> description
> >> >> > on
> >> >> > how I think to impl it in Lucene (been spending some time on that),
> >> >> > and
> >> >> > we can have more detailed (and fun) discussions on the low level
> >> >> > details.
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farhang@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Good point. I meant the model at the document level: i.e.
what
> >> >> >> milestones does a document go through in its life cycle. Today:
> >> >> >>
> >> >> >> created --> deleted
> >> >> >>
> >> >> >> With incremental updates:
> >> >> >>
> >> >> >> created --> update1 --> update2 --> deleted
> >> >> >>
> >> >> >> I think what I'm trying to say is that this second threaded
> sequence
> >> >> >> of state changes seems intuitively more fragile under concurrent
> >> >> >> scenarios.  So for example, in a lock-free design, the system
> would
> >> >> >> also have to anticipate the following sequence of events:
> >> >> >>
> >> >> >> created --> update1 --> deleted --> update2
> >> >> >>
> >> >> >> and consider update2 a null op.  I'm imagining there are other
> cases
> >> >> >> that I can't think of..
> >> >> >>
> >> >> >> -Babak
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> > write once, plus the option to the app to keep multiple
commit
> >> >> >> > points
> >> >> >> > around (by customizing the deletion policy).
> >> >> >> >
> >> >> >> > Actually order of operations / commits very much matters
in
> Lucene
> >> >> >> > today.
> >> >> >> >
> >> >> >> > Deletions are not idempotent: if you add a doc w/ term
X, delete
> >> >> >> > by
> >> >> >> > term X, add a new doc with term X... that's very different
than
> if
> >> >> >> > you
> >> >> >> > moved the delete op to the end.  Ie the deletion only
applies to
> >> >> >> > the
> >> >> >> > docs added before it.
> >> >> >> >
> >> >> >> > Mike
> >> >> >> >
> >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <
> farhang@gmail.com>
> >> >> >> > wrote:
> >> >> >> >> Sure. Because of the write once principle.  But at
some cost
> >> >> >> >> (duplicated data). I was just agreeing that it would
not be a
> >> >> >> >> good
> >> >> >> >> idea to bake in version-ing by keeping the layers
around
> forever
> >> >> >> >> in
> >> >> >> >> a
> >> >> >> >> merged index; I wasn't keying in on transactions
per se.
> >> >> >> >>
> >> >> >> >> Speaking of transactions: I'm not sure if we should
worry about
> >> >> >> >> this
> >> >> >> >> much yet, but with "updates" the order of the transaction
> commits
> >> >> >> >> seems important. I think commit order is less important
today
> in
> >> >> >> >> Lucene because its model supports only 2 types of
events:
> >> >> >> >> document
> >> >> >> >> creation--which only happens once, and document deletion,
which
> >> >> >> >> is
> >> >> >> >> idempotent.  What do you think? Will commits have
to be ordered
> >> >> >> >> if
> >> >> >> >> we
> >> >> >> >> introduce updates?  Or does the onus of maintaining
order fall
> on
> >> >> >> >> the
> >> >> >> >> application?
> >> >> >> >>
> >> >> >> >> -Babak
> >> >> >> >>
> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang
> >> >> >> >>> <farhang@gmail.com>
> >> >> >> >>> wrote:
> >> >> >> >>>>> I think they get merged in by the merger,
ideally in the
> >> >> >> >>>>> background.
> >> >> >> >>>>
> >> >> >> >>>> That sounds sensible. (In other words, we
wont concern
> >> >> >> >>>> ourselves
> >> >> >> >>>> with
> >> >> >> >>>> roll backs--something possible while a "layer"
is still
> >> >> >> >>>> around.)
> >> >> >> >>>
> >> >> >> >>> Actually roll backs would still be very possible
even if
> layers
> >> >> >> >>> are
> >> >> >> >>> merged.
> >> >> >> >>>
> >> >> >> >>> Ie, one could keep multiple commits around, and
the older
> >> >> >> >>> commits
> >> >> >> >>> would still be referring to the old postings
+ layers, keeping
> >> >> >> >>> them
> >> >> >> >>> alive.
> >> >> >> >>>
> >> >> >> >>> Lucene would still be transactional with such
an approach.
> >> >> >> >>>
> >> >> >> >>> Mike
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >>> To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >>> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> ---------------------------------------------------------------------
> >> >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> > For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message