Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 34291 invoked from network); 9 May 2010 06:50:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 May 2010 06:50:24 -0000 Received: (qmail 64797 invoked by uid 500); 9 May 2010 06:50:23 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 64728 invoked by uid 500); 9 May 2010 06:50:23 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 64721 invoked by uid 99); 9 May 2010 06:50:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 May 2010 06:50:22 +0000 X-ASF-Spam-Status: No, hits=1.7 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of serera@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 May 2010 06:50:17 +0000 Received: by fxm16 with SMTP id 16so1700934fxm.35 for ; Sat, 08 May 2010 23:49:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=fbpV0gKSeVzM3desRADGXRnIjFz4g8HQ1o4ctSz3P7g=; b=luurBHyji0I36e6/b3cbnxs7hzqObRjf6wr0tcsRSNOb5dv35qFwKb1E3b6L4TUpzU cXGgVaRL3rGQ/E2aBrFlXEL6kqj+u8rg3xDZsbP1k7JUsbWOH6t0UD6aXxJeph65Wjyb m6kPCx4N4GJE4M3q4SR3xGC6N1sL6L0pGeR44= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QBUtBfnRntAAxFFNZfUO/AcTnJNcDjqEeZtvq1aLHrSwx0ElBYOKQKIDYeANy5QxFW VkmnFhpVuD7Lf0fj5EsrCIJJ/R6DJldoozupfB6MmiCaVJnpX4WhRDcq/QlG+kPLFhSy BHVHLmqrC6zPZInm1LlhnTzpBJeJVLTn2IaiM= MIME-Version: 1.0 Received: by 10.102.166.8 with SMTP id o8mr1259343mue.13.1273387795524; Sat, 08 May 2010 23:49:55 -0700 (PDT) Received: by 10.103.39.4 with HTTP; Sat, 8 May 2010 23:49:55 -0700 (PDT) In-Reply-To: References: <82E4F109-B052-4AA7-950A-9B5BCEC928E2@apache.org> Date: Sun, 9 May 2010 09:49:55 +0300 Message-ID: Subject: Re: Incremental Field Updates From: Shai Erera To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=00163642675736acb9048623b3ed --00163642675736acb9048623b3ed Content-Type: text/plain; charset=ISO-8859-1 No, actually, you can update index-only fields also. It all depends on the operation that you ask to do on the fields. For example, if the query to execute is something like "update all documents w/ tags:ibm -> remove terms t1, t2, t3 and add terms t4, t5", then the result of such request would dissociate t1-3 from those docs that answer tags:ibm and associate them w/ t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the following posting updates will be done: t1: -1, -4, -10 t2: -1, -4, -10 t3: -1, -4, -10 t4, 1, 4, 10 t5, 1, 4, 10 (in addition to whatever other updates that are associated with those postings). At search time, if you search for "t1 OR t2", then the regular t1 and t2 postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4, 10. If you want to e.g. remove an entire field w/ such update operation, then it becomes more expensive, but in general you'd need to iterate over the field's terms and dissociate the documents from all the terms. We didn't face such a scenario though, and I think it's probably a rare one? Shai On Sun, May 9, 2010 at 9:39 AM, Babak Farhang wrote: > Shai, > > I think this is an interesting approach. I can see how I could > [incrementally] update a stored, indexed field this way, but I don't > understand how I could update an indexed-only field. Here's why: for a > stored (and indexed) field, I can always determine what terms to > remove ('-') from the postings, but for an indexed-only field I'd have > no [practical] way to know.. > > So under this approach, I'm thinking at a user level, a Lucene field > would be updateable only if it's at least stored. > > Is that right? > > -Babak > > On Wed, May 5, 2010 at 11:17 AM, Shai Erera wrote: > > Yes Mike - I don't know yet if two MPs will be used, one for the stacked > > segments and one for the general segments (which will include the stacked > > ones as well) .. feels like one MP should be enough, but this can be > decided > > on as we progress. > > > > This approach allows you to update every term in every already indexed > > field, as well as add terms to already indexed fields ... and add totally > > new fields, with lots of text in them. So e.g. there are two neat use > cases > > that come to mind: > > 1) Allow users to rate search results, favor them etc. > > 2) Or even comment them, > > I think Google offers the 2nd. Both translate into updating the search > > result's already indexed document w/ the new rating, comment etc. w/o > > needing to reindex the document. > > > > I will try to find perf results numbers. It's been long time ago, hope > all > > the documents are still where they were :). Indeed, for terms like ACLs, > it > > means that each query had to merge dozens of postings lists. For that I > > implemented an alternative solution, which uses a payload-like structure > > that registers for each document the list of ACLs that are associated > with > > it (not as strings, it was more efficient). Then, if the query included > > dozens of such terms, I created a filter-like object which for every > > matching document by the query checked if it matches the ACLs list of the > > document. This is usually slower, because the ACLs themselves don't drive > > the query, which means more matches will be found. That was a tradeoff > which > > one could configure based on the number of such terms in the query, the > > number of updated terms etc. > > > > But in essence you're right - if the solution is generic to cover any > term, > > we cannot use such payload-based feature. We might need to merge the > stacked > > segments more frequently. This is pending perf testing though. > > > > Shai > > > > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless > > wrote: > >> > >> Catching up here :) > >> > >> This is great stuff Shai -- I like the notion of "negative" postings, > >> that "subtract" docs from previous generations as you iterate them. > >> > >> And I like the term "stacked segments". This fits very well with > >> Lucene's write-once approach, ie, a writer can at any time stack a new > >> segment (writes to new files) "over" an old segment, like the layers > >> in photoshop. A reader merges all stacks on-the-fly when reading. > >> > >> And the merge policy now picks from 2 dimensions right? Ie it may > >> want to simply consolidate stacks on an old segment but otherwise not > >> merge that segment (eg for very large segments that have accumulated > >> updates), and normal merging would of course consolidate all stacks > >> for all segments merged. > >> > >> Wouldn't this approach conceivably allow you to alter single terms > >> within a single field (we'd have to figure out how we'd expose the API > >> for this)? EG if I appended some text to an already-indexed field? > >> (In addition to adding a new field to an already indexed doc, or > >> replacing an indexed field on a previously indexed doc). > >> > >> Did you have any hard perf numbers? Merge sorting N streams is > >> surprisingly costly... we may need/want to have a reader pre-merge > >> (using up RAM) any "long tail" of stacked segments as long as they are > >> small enough... > >> > >> This sounds great!! > >> > >> Mike > >> > >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera wrote: > >> > Hi, > >> > > >> > WARNING: following email is a bit long, but I think is worth the > reading > >> > :) > >> > > >> > I would like to describe my implementation of incremental field > updates > >> > in Juru (the former search library I've worked on in IBM). I will > later > >> > discuss how I think it can be implemented in Lucene. > >> > > >> > The motivation/requirement came from a doc management system which > used > >> > Juru as its search component. The system included document libraries > >> > where users could create files and upload documents. A user could > belong > >> > to any number of libraries and complex ACLs model was used (down to > the > >> > level of the file). ACLs and Folders were modeled as categories in the > >> > index (boolean-like terms). Files and folders could be moved around > and > >> > access to a library/folder/file could be granted/revoked at any given > >> > time. Therefore, such updates usually affected hundreds (and > thousands) > >> > of documents. Overall, the index managed millions of documents, tens > of > >> > thousands of libraries and hundreds of thousands of ACLs (large > >> > organization :)). To get a rough understanding on the number of such > >> > updates - every several minutes, tens of thousands of documents were > >> > updated due to such changes only (in addition to the regular content > >> > updates). > >> > > >> > We were asked to support requests in the following form: "update all > >> > docs > >> > that match --> do " where: > >> > * was "a single doc", "docs belonging to a category" and > >> > "docs > >> > belonging to a set of categories". > >> > * was "add categories NEW" (NEW might not even exist in > the > >> > index yet, or already associated w/ the document), "remove categories > >> > OLD" > >> > (irregardless if the docs were already associated w/ OLD or not) and > >> > "remove all OLD and add all NEW". > >> > > >> > The solution I've implemented to support these requests turned out to > >> > actually allow you to update every term (!) in the index: suppose that > >> > you have a table, which recorded tuples like . The > >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term > "ibm", > >> > and the record <1, "hp", '-'> means that the same document is not > >> > associated w/ the word "hp". Then, you could very easily ask for all > >> > documents that are assoicated w/ "hp", and the result would not > include > >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id > the > >> > document received. > >> > > >> > I've kept two types of postings in the index: regular and updates. > >> > Taking the above examples, "ibm" regular posting looked like <"ibm", > 1, > >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", > +2, > >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for > >> > "hp". > >> > > >> > During search time, when a query with the word "ibm" was submitted, I > >> > create a virtual posting which reads from both the regular and the > >> > updates, and merges them on the fly according to the +/- signs. Since > >> > both postings are sorted in ascending order, the merge is very > >> > efficient, and query time is hardly affected. > >> > > >> > Those postings are merged from time to time in a process that is > similar > >> > to how Lucene works today, which keeps the update postings relatively > >> > small and manageable. > >> > > >> > Now here comes the fun part - how I think it can be implemented in > >> > Lucene ! > >> > > >> > To be honest, this sat on my TODO list for a very long time and only a > >> > couple of months ago I figured out how to implement it in Lucene. The > >> > main difficulty I had was around the difference between the write-once > >> > policy in Juru and Lucene - in Lucene, once a segment is written, it > >> > cannot be changed. BUT, I've only recently realized that this isn't > >> > exactly true, because deleted docs do change existing segments. The > >> > deletes are kept in a separate file to the segment (.del) and have > their > >> > own generation. Deletes, as I understood then, and Grant helped me > term > >> > them better, can be defined as "Stacked Segments" - they add data to a > >> > segment, which from time to time are integrated into the segment > (unlike > >> > Photoshop Layers, but my understanding of Photoshop is limited). And > the > >> > Lucene engine knows how to combine the two, giving precedence to the > >> > deletes. > >> > > >> > By introducing an "Updates Stacked Segment", we can encode postings w/ > >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can > >> > create a variation which merges the two lists. When segments are > merged, > >> > the updates will be merged into the regular postings (just like > deletes) > >> > and thus will be gone. In addition, this plays very nicely with > readers > >> > that are currently reading the index, as well as we can have > generations > >> > for the updates - really like deletes ! > >> > > >> > I think that Lucene's architecture allows for such a solution very > >> > cleanly and nicely (and I believe flex makes it even easier). We can > >> > (later, after you've digested the idea) discuss whether this should be > >> > built into the current IW, or an extension like UpdateableIW. The API > >> > I've been thinking about should really be like deletes, allowing to > >> > update docs based on Term or Query. I defer the API discussion for > later > >> > for now. > >> > > >> > As for stored fields, this was a real challenge to support in Juru, > but > >> > I think that w/ "Stacked Segments" and Lucene's architecture, this > >> > should > >> > be much easier - adding stacked stored fields ... > >> > > >> > As you've noticed, the update postings are not DGap encoded, and sign > >> > needs to be preserved. While I haven't implemented it in Juru, I think > >> > that perhaps this can be improved by keeping the '-' and '+' lists > >> > separated. We will need to register somewhere which came before which > >> > because order matters a lot here (and I'm not talking about > concurrency > >> > - simple update instructions order). I have some idea how this can be > >> > achieved, but I refrain from describing it now, to not make this email > >> > even longer :). > >> > > >> > I've mentioned that this approach can be applied to any term and not > >> > just categories under some circumstances. Basically, as soon as you > >> > update a term, its DF is no longer true, unless you are able to take > the > >> > updates into account. We can defer the discussion on that, but clearly > >> > for many fields, incrementally update them should not affect > precision, > >> > as they're not used for that type of scoring ... Maybe, by keeping > >> > separate '+' and '-' lists we can compute statistics precisely. And I > >> > haven't given much thought yet to how this and Mike's flex scoring > will > >> > be integrated. > >> > > >> > BTW, a word on Parallel Indexing - the two are completely orthogonal. > >> > Once PI is introduced, one can index all the updateable fields in a > >> > dedicated slice, for perhaps improving search performance for slices > >> > that are not updateable (not involving code which attempts to read and > >> > merge update and regular lists on the fly). Also, incremental field > >> > updates support all of PI's scenarios, even though some will be done > >> > more efficiently w/ PI. But this too is a matter for a separate > >> > discussion :). > >> > > >> > That's it ! I believe I've given you all the details I have about the > >> > approach and high level proposed solution for Lucene. Perhaps some > >> > details slipped my mind, but if you ask the right questions, I'm sure > >> > I'll be able to answer them :). I would like to emphasize that since > >> > this was already implemented (in Juru) - this is more than just a "I > >> > think this approach can work" proposal ... > >> > > >> > I would appreciate your comments on this. I would like to start > >> > implementing it soon, and so as a first step, please share your > comments > >> > on the overall approach. I'll then write a more detailed description > on > >> > how I think to impl it in Lucene (been spending some time on that), > and > >> > we can have more detailed (and fun) discussions on the low level > >> > details. > >> > > >> > Shai > >> > > >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang > wrote: > >> >> > >> >> Good point. I meant the model at the document level: i.e. what > >> >> milestones does a document go through in its life cycle. Today: > >> >> > >> >> created --> deleted > >> >> > >> >> With incremental updates: > >> >> > >> >> created --> update1 --> update2 --> deleted > >> >> > >> >> I think what I'm trying to say is that this second threaded sequence > >> >> of state changes seems intuitively more fragile under concurrent > >> >> scenarios. So for example, in a lock-free design, the system would > >> >> also have to anticipate the following sequence of events: > >> >> > >> >> created --> update1 --> deleted --> update2 > >> >> > >> >> and consider update2 a null op. I'm imagining there are other cases > >> >> that I can't think of.. > >> >> > >> >> -Babak > >> >> > >> >> > >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless > >> >> wrote: > >> >> > write once, plus the option to the app to keep multiple commit > points > >> >> > around (by customizing the deletion policy). > >> >> > > >> >> > Actually order of operations / commits very much matters in Lucene > >> >> > today. > >> >> > > >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete by > >> >> > term X, add a new doc with term X... that's very different than if > >> >> > you > >> >> > moved the delete op to the end. Ie the deletion only applies to > the > >> >> > docs added before it. > >> >> > > >> >> > Mike > >> >> > > >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang > >> >> > wrote: > >> >> >> Sure. Because of the write once principle. But at some cost > >> >> >> (duplicated data). I was just agreeing that it would not be a good > >> >> >> idea to bake in version-ing by keeping the layers around forever > in > >> >> >> a > >> >> >> merged index; I wasn't keying in on transactions per se. > >> >> >> > >> >> >> Speaking of transactions: I'm not sure if we should worry about > this > >> >> >> much yet, but with "updates" the order of the transaction commits > >> >> >> seems important. I think commit order is less important today in > >> >> >> Lucene because its model supports only 2 types of events: document > >> >> >> creation--which only happens once, and document deletion, which is > >> >> >> idempotent. What do you think? Will commits have to be ordered if > >> >> >> we > >> >> >> introduce updates? Or does the onus of maintaining order fall on > >> >> >> the > >> >> >> application? > >> >> >> > >> >> >> -Babak > >> >> >> > >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless > >> >> >> wrote: > >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang > > >> >> >>> wrote: > >> >> >>>>> I think they get merged in by the merger, ideally in the > >> >> >>>>> background. > >> >> >>>> > >> >> >>>> That sounds sensible. (In other words, we wont concern ourselves > >> >> >>>> with > >> >> >>>> roll backs--something possible while a "layer" is still around.) > >> >> >>> > >> >> >>> Actually roll backs would still be very possible even if layers > are > >> >> >>> merged. > >> >> >>> > >> >> >>> Ie, one could keep multiple commits around, and the older commits > >> >> >>> would still be referring to the old postings + layers, keeping > them > >> >> >>> alive. > >> >> >>> > >> >> >>> Lucene would still be transactional with such an approach. > >> >> >>> > >> >> >>> Mike > >> >> >>> > >> >> >>> > >> >> >>> > --------------------------------------------------------------------- > >> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > >> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org > >> >> >>> > >> >> >>> > >> >> >> > >> >> >> > >> >> >> > --------------------------------------------------------------------- > >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > >> >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org > >> >> >> > >> >> >> > >> >> > > >> >> > > --------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org > >> >> > > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org > >> >> > >> > > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: dev-help@lucene.apache.org > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --00163642675736acb9048623b3ed Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
No, actually, you can update index-only fields also. It al= l depends on the operation that you ask to do on the fields. For example, i= f the query to execute is something like "update all documents w/ tags= :ibm -> remove terms t1, t2, t3 and add terms t4, t5", then the res= ult of such request would dissociate t1-3 from those docs that answer tags:= ibm and associate them w/ t4 and t5. Specifically, if docs 1, 4, 10 answer = tags:ibm, then the following posting updates will be done:
t1: -1, -4, -10
t2: -1, -4, -10
t3: -1, -4, -10
t4, 1, 4, 10
t5= , 1, 4, 10
(in addition to whatever other updates that are associated wi= th those postings).

At search time, if you search for "t1 OR t2= ", then the regular t1 and t2 postings will be merged on-the-fly w/ th= e updated ones to remove docs 1, 4, 10.

If you want to e.g. remove an entire field w/ such update operation, th= en it becomes more expensive, but in general you'd need to iterate over= the field's terms and dissociate the documents from all the terms. We = didn't face such a scenario though, and I think it's probably a rar= e one?

Shai

On Sun, May 9, 2010 at 9:39 AM, = Babak Farhang <fa= rhang@gmail.com> wrote:
Shai,

I think this is an interesting approach. I can see how I could
[incrementally] update a stored, indexed field this way, but I don't understand how I could update an indexed-only field. Here's why: for a<= br> stored (and indexed) field, I can always determine what terms to
remove ('-') from the postings, but for an indexed-only field I'= ;d have
no [practical] way to know..

So under this approach, =A0I'm thinking at a user level, a Lucene field=
would be updateable only if it's at least stored.

Is that right?

-Babak

On Wed, May 5, 2010 at 11:17 AM, Shai Erera <serera@gmail.com> wrote:
> Yes Mike - I don't know yet if two MPs will be used, one for the s= tacked
> segments and one for the general segments (which will include the stac= ked
> ones as well) .. feels like one MP should be enough, but this can be d= ecided
> on as we progress.
>
> This approach allows you to update every term in every already indexed=
> field, as well as add terms to already indexed fields ... and add tota= lly
> new fields, with lots of text in them. So e.g. there are two neat use = cases
> that come to mind:
> 1) Allow users to rate search results, favor them etc.
> 2) Or even comment them,
> I think Google offers the 2nd. Both translate into updating the search=
> result's already indexed document w/ the new rating, comment etc. = w/o
> needing to reindex the document.
>
> I will try to find perf results numbers. It's been long time ago, = hope all
> the documents are still where they were :). Indeed, for terms like ACL= s, it
> means that each query had to merge dozens of postings lists. For that = I
> implemented an alternative solution, which uses a payload-like structu= re
> that registers for each document the list of ACLs that are associated = with
> it (not as strings, it was more efficient). Then, if the query include= d
> dozens of such terms, I created a filter-like object which for every > matching document by the query checked if it matches the ACLs list of = the
> document. This is usually slower, because the ACLs themselves don'= t drive
> the query, which means more matches will be found. That was a tradeoff= which
> one could configure based on the number of such terms in the query, th= e
> number of updated terms etc.
>
> But in essence you're right - if the solution is generic to cover = any term,
> we cannot use such payload-based feature. We might need to merge the s= tacked
> segments more frequently. This is pending perf testing though.
>
> Shai
>
> On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> <lucene@mikemccandless= .com> wrote:
>>
>> Catching up here :)
>>
>> This is great stuff Shai -- I like the notion of "negative&qu= ot; postings,
>> that "subtract" docs from previous generations as you it= erate them.
>>
>> And I like the term "stacked segments". =A0This fits ver= y well with
>> Lucene's write-once approach, ie, a writer can at any time sta= ck a new
>> segment (writes to new files) "over" an old segment, lik= e the layers
>> in photoshop. =A0A reader merges all stacks on-the-fly when readin= g.
>>
>> And the merge policy now picks from 2 dimensions right? =A0Ie it m= ay
>> want to simply consolidate stacks on an old segment but otherwise = not
>> merge that segment (eg for very large segments that have accumulat= ed
>> updates), and normal merging would of course consolidate all stack= s
>> for all segments merged.
>>
>> Wouldn't this approach conceivably allow you to alter single t= erms
>> within a single field (we'd have to figure out how we'd ex= pose the API
>> for this)? =A0EG if I appended some text to an already-indexed fie= ld?
>> (In addition to adding a new field to an already indexed doc, or >> replacing an indexed field on a previously indexed doc).
>>
>> Did you have any hard perf numbers? =A0Merge sorting N streams is<= br> >> surprisingly costly... we may need/want to have a reader pre-merge=
>> (using up RAM) any "long tail" of stacked segments as lo= ng as they are
>> small enough...
>>
>> This sounds great!!
>>
>> Mike
>>
>> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <serera@gmail.com> wrote:
>> > Hi,
>> >
>> > WARNING: following email is a bit long, but I think is worth = the reading
>> > :)
>> >
>> > I would like to describe my implementation of incremental fie= ld updates
>> > in Juru (the former search library I've worked on in IBM)= . I will later
>> > discuss how I think it can be implemented in Lucene.
>> >
>> > The motivation/requirement came from a doc management system = which used
>> > Juru as its search component. The system included document li= braries
>> > where users could create files and upload documents. A user c= ould belong
>> > to any number of libraries and complex ACLs model was used (d= own to the
>> > level of the file). ACLs and Folders were modeled as categori= es in the
>> > index (boolean-like terms). Files and folders could be moved = around and
>> > access to a library/folder/file could be granted/revoked at a= ny given
>> > time. Therefore, such updates usually affected hundreds (and = thousands)
>> > of documents. Overall, the index managed millions of document= s, tens of
>> > thousands of libraries and hundreds of thousands of ACLs (lar= ge
>> > organization :)). To get a rough understanding on the number = of such
>> > updates - every several minutes, tens of thousands of documen= ts were
>> > updated due to such changes only (in addition to the regular = content
>> > updates).
>> >
>> > We were asked to support requests in the following form: &quo= t;update all
>> > docs
>> > that match <criteria> --> do <operation>"= where:
>> > * <criteria> was "a single doc", "docs b= elonging to a category" and
>> > "docs
>> > belonging to a set of categories".
>> > * <operation> was "add categories NEW" (NEW m= ight not even exist in the
>> > index yet, or already associated w/ the document), "remo= ve categories
>> > OLD"
>> > (irregardless if the docs were already associated w/ OLD or n= ot) and
>> > "remove all OLD and add all NEW".
>> >
>> > The solution I've implemented to support these requests t= urned out to
>> > actually allow you to update every term (!) in the index: sup= pose that
>> > you have a table, which recorded tuples like <docid, term,= +/->. The
>> > record <1, "ibm", '+'> means that doc= 1 is associated w/ the term "ibm",
>> > and the record <1, "hp", '-'> means t= hat the same document is not
>> > associated w/ the word "hp". Then, you could very e= asily ask for all
>> > documents that are assoicated w/ "hp", and the resu= lt would not include
>> > doc 1. Note that docid=3D1 is not the app Doc_ID, but the int= ernal id the
>> > document received.
>> >
>> > I've kept two types of postings in the index: regular and= updates.
>> > Taking the above examples, "ibm" regular posting lo= oked like <"ibm", 1,
>> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked lik= e <"ibm", +2,
>> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Si= milarly for
>> > "hp".
>> >
>> > During search time, when a query with the word "ibm"= ; was submitted, I
>> > create a virtual posting which reads from both the regular an= d the
>> > updates, and merges them on the fly according to the +/- sign= s. Since
>> > both postings are sorted in ascending order, the merge is ver= y
>> > efficient, and query time is hardly affected.
>> >
>> > Those postings are merged from time to time in a process that= is similar
>> > to how Lucene works today, which keeps the update postings re= latively
>> > small and manageable.
>> >
>> > Now here comes the fun part - how I think it can be implement= ed in
>> > Lucene !
>> >
>> > To be honest, this sat on my TODO list for a very long time a= nd only a
>> > couple of months ago I figured out how to implement it in Luc= ene. The
>> > main difficulty I had was around the difference between the w= rite-once
>> > policy in Juru and Lucene - in Lucene, once a segment is writ= ten, it
>> > cannot be changed. BUT, I've only recently realized that = this isn't
>> > exactly true, because deleted docs do change existing segment= s. The
>> > deletes are kept in a separate file to the segment (.del) and= have their
>> > own generation. Deletes, as I understood then, and Grant help= ed me term
>> > them better, can be defined as "Stacked Segments" -= they add data to a
>> > segment, which from time to time are integrated into the segm= ent (unlike
>> > Photoshop Layers, but my understanding of Photoshop is limite= d). And the
>> > Lucene engine knows how to combine the two, giving precedence= to the
>> > deletes.
>> >
>> > By introducing an "Updates Stacked Segment", we can= encode postings w/
>> > the '+'/'-' signs, and when TermDocs/Position= s is requested, we can
>> > create a variation which merges the two lists. When segments = are merged,
>> > the updates will be merged into the regular postings (just li= ke deletes)
>> > and thus will be gone. In addition, this plays very nicely wi= th readers
>> > that are currently reading the index, as well as we can have = generations
>> > for the updates - really like deletes !
>> >
>> > I think that Lucene's architecture allows for such a solu= tion very
>> > cleanly and nicely (and I believe flex makes it even easier).= We can
>> > (later, after you've digested the idea) discuss whether t= his should be
>> > built into the current IW, or an extension like UpdateableIW.= The API
>> > I've been thinking about should really be like deletes, a= llowing to
>> > update docs based on Term or Query. I defer the API discussio= n for later
>> > for now.
>> >
>> > As for stored fields, this was a real challenge to support in= Juru, but
>> > I think that w/ "Stacked Segments" and Lucene's= architecture, this
>> > should
>> > be much easier - adding stacked stored fields ...
>> >
>> > As you've noticed, the update postings are not DGap encod= ed, and sign
>> > needs to be preserved. While I haven't implemented it in = Juru, I think
>> > that perhaps this can be improved by keeping the '-' = and '+' lists
>> > separated. We will need to register somewhere which came befo= re which
>> > because order matters a lot here (and I'm not talking abo= ut concurrency
>> > - simple update instructions order). I have some idea how thi= s can be
>> > achieved, but I refrain from describing it now, to not make t= his email
>> > even longer :).
>> >
>> > I've mentioned that this approach can be applied to any t= erm and not
>> > just categories under some circumstances. Basically, as soon = as you
>> > update a term, its DF is no longer true, unless you are able = to take the
>> > updates into account. We can defer the discussion on that, bu= t clearly
>> > for many fields, incrementally update them should not affect = precision,
>> > as they're not used for that type of scoring ... Maybe, b= y keeping
>> > separate '+' and '-' lists we can compute sta= tistics precisely. And I
>> > haven't given much thought yet to how this and Mike's= flex scoring will
>> > be integrated.
>> >
>> > BTW, a word on Parallel Indexing - the two are completely ort= hogonal.
>> > Once PI is introduced, one can index all the updateable field= s in a
>> > dedicated slice, for perhaps improving search performance for= slices
>> > that are not updateable (not involving code which attempts to= read and
>> > merge update and regular lists on the fly). Also, incremental= field
>> > updates support all of PI's scenarios, even though some w= ill be done
>> > more efficiently w/ PI. But this too is a matter for a separa= te
>> > discussion :).
>> >
>> > That's it ! I believe I've given you all the details = I have about the
>> > approach and high level proposed solution for Lucene. Perhaps= some
>> > details slipped my mind, but if you ask the right questions, = I'm sure
>> > I'll be able to answer them :). I would like to emphasize= that since
>> > this was already implemented (in Juru) - this is more than ju= st a "I
>> > think this approach can work" proposal ...
>> >
>> > I would appreciate your comments on this. I would like to sta= rt
>> > implementing it soon, and so as a first step, please share yo= ur comments
>> > on the overall approach. I'll then write a more detailed = description on
>> > how I think to impl it in Lucene (been spending some time on = that), and
>> > we can have more detailed (and fun) discussions on the low le= vel
>> > details.
>> >
>> > Shai
>> >
>> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farhang@gmail.com> wrote:
>> >>
>> >> Good point. I meant the model at the document level: i.e.= what
>> >> milestones does a document go through in its life cycle. = Today:
>> >>
>> >> created --> deleted
>> >>
>> >> With incremental updates:
>> >>
>> >> created --> update1 --> update2 --> deleted
>> >>
>> >> I think what I'm trying to say is that this second th= readed sequence
>> >> of state changes seems intuitively more fragile under con= current
>> >> scenarios. =A0So for example, in a lock-free design, the = system would
>> >> also have to anticipate the following sequence of events:=
>> >>
>> >> created --> update1 --> deleted --> update2
>> >>
>> >> and consider update2 a null op. =A0I'm imagining ther= e are other cases
>> >> that I can't think of..
>> >>
>> >> -Babak
>> >>
>> >>
>> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
>> >> <lucene@m= ikemccandless.com> wrote:
>> >> > write once, plus the option to the app to keep multi= ple commit points
>> >> > around (by customizing the deletion policy).
>> >> >
>> >> > Actually order of operations / commits very much mat= ters in Lucene
>> >> > today.
>> >> >
>> >> > Deletions are not idempotent: if you add a doc w/ te= rm X, delete by
>> >> > term X, add a new doc with term X... that's very= different than if
>> >> > you
>> >> > moved the delete op to the end. =A0Ie the deletion o= nly applies to the
>> >> > docs added before it.
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <<= a href=3D"mailto:farhang@gmail.com">farhang@gmail.com>
>> >> > wrote:
>> >> >> Sure. Because of the write once principle. =A0Bu= t at some cost
>> >> >> (duplicated data). I was just agreeing that it w= ould not be a good
>> >> >> idea to bake in version-ing by keeping the layer= s around forever in
>> >> >> a
>> >> >> merged index; I wasn't keying in on transact= ions per se.
>> >> >>
>> >> >> Speaking of transactions: I'm not sure if we= should worry about this
>> >> >> much yet, but with "updates" the order= of the transaction commits
>> >> >> seems important. I think commit order is less im= portant today in
>> >> >> Lucene because its model supports only 2 types o= f events: document
>> >> >> creation--which only happens once, and document = deletion, which is
>> >> >> idempotent. =A0What do you think? Will commits h= ave to be ordered if
>> >> >> we
>> >> >> introduce updates? =A0Or does the onus of mainta= ining order fall on
>> >> >> the
>> >> >> application?
>> >> >>
>> >> >> -Babak
>> >> >>
>> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandle= ss
>> >> >> <lucene@mikemccandless.com> wrote:
>> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhan= g <farhang@gmail.com>
>> >> >>> wrote:
>> >> >>>>> I think they get merged in by the me= rger, ideally in the
>> >> >>>>> background.
>> >> >>>>
>> >> >>>> That sounds sensible. (In other words, w= e wont concern ourselves
>> >> >>>> with
>> >> >>>> roll backs--something possible while a &= quot;layer" is still around.)
>> >> >>>
>> >> >>> Actually roll backs would still be very poss= ible even if layers are
>> >> >>> merged.
>> >> >>>
>> >> >>> Ie, one could keep multiple commits around, = and the older commits
>> >> >>> would still be referring to the old postings= + layers, keeping them
>> >> >>> alive.
>> >> >>>
>> >> >>> Lucene would still be transactional with suc= h an approach.
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>>
>> >> >>> --------------------------------------------= -------------------------
>> >> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.or= g
>> >> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org=
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> ------------------------------------------------= ---------------------
>> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail:
java-dev-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> > ----------------------------------------------------= -----------------
>> >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> >> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------= ------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ------------------------------------------------------------------= ---
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


--00163642675736acb9048623b3ed--