lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Incremental Field Updates
Date Mon, 10 May 2010 08:05:16 GMT
That's an interesting scenario Mike.

Previously, I only handled boolean-like terms, as the scenarios we were
asked to support involved just those types of terms. Obviously, when the
approach allows for more, more scenarios pop to mind :).

I think we may still be able to resolve that case, but it becomes much more
complicated. My design approach of adding the +/- affected the entire
posting element, whereas the scenario you describe affects the positions of
the posting element. This calls for a more complicated design and solution.

My take on it is that if someone wants to update the catch-all field, then
reindexing the document may not be such a bad idea anyway. The purpose of
those incremental updates is to cope w/ high frequency of updates, which
usually happen on metadata fields, and not title.

But since one could add the 'tags' to the catch-all field as well, it brings
us to the same point - how do I remove the positions of term X that relate
to the tag X and not the potentially original term X that existed in the
document?

This is a very advanced case (and interesting). I don't want to hold up the
discussion on it, but want to make sure we do not deviate from getting the
more simpler cases in first. Depending on the API, this might be very easy
to solve, but might also complicate matters. Maybe, for a
incr-field-updates-v1, we can do without it?

Shai

On Mon, May 10, 2010 at 10:43 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I think another example would be the catch-all field.
>
> EG say my app concatenates the title, abstract and body of a document
> into the catch-all field.
>
> But now I want to change just the title.
>
> I think in theory (assuming we can work out an intuitive user-level
> API exposure of this...), on changing the title, we could
> incrementally re-index the catch-all field, to remove the old title
> and add the new one.  Right?
>
> But: would this approach work on the positions too?
>
> EG say term X appears 4 times in the doc -- positions 1, 7, 22, 53.
> I've now replaced a part of this field, affecting only this term's
> occurrence at position 53, so that maybe that occurrence is now at 57.
>  Are we able to handle this?
>
> One possible low-level API (I think Doug originally suggested this)
> might be to allow .remove() calls on the enums (like many Java
> collection support from their iterators) -- eg as you are stepping
> through the DocsEnum, you could call .remove() to complete
> disassociate that term from that doc, or in DocsAndPositionsEnum you
> could call .removePosition to remove a specific position.
>
> We would still need a higher level user API... maybe you'd provide a
> "text to remove" and a "text to add" for each field, pre-init'd with
> position/offset for the analyzer (if the field is analyzed)?  I guess
> we'd make this a new attr on Field.  Ie, Field normally contains just
> "text/tokens to add to the index", but we could also include a
> "text/tokens to remove".  Or, each Field instance is marked as "to be
> added" vs "to be removed", and you have to add 2 Field instances to
> subtract then add.
>
> Mike
>
> On Sun, May 9, 2010 at 7:38 AM, Shai Erera <serera@gmail.com> wrote:
> >> When I update a field, I
> >> want to update all of it, not just part of it. No?
> >
> > Well ... might be. But the most common case for us are fields to which we
> > want to add data or remove. For example ACLs - you don't want to replace
> the
> > entire field w/ the document, but simply to add/remove access for certain
> > people/groups. Same goes for "social" fields, like tags, ratings,
> bookmarks
> > etc. - the granularity of the update is to associate/dissociate a
> particular
> > value w/ the field + doc, and not update the entire field.
> >
> > Shai
> >
> > On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <farhang@gmail.com>
> wrote:
> >>
> >> > No, actually, you can update index-only fields also. It all depends on
> >> > the
> >> > operation that you ask to do on the fields.
> >>
> >> I love the level of control this provides, but again, I was talking at
> >> the user level.
> >>
> >> > If you want to e.g. remove an entire field w/ such update operation,
> >> > then it
> >> > becomes more expensive
> >>
> >> That's the typical usage scenario, I imagine. When I update a field, I
> >> want to update all of it, not just part of it. No?
> >>
> >> (The lower level semantics of twiddling with the postings is poorly
> >> understood by the typical user..)
> >>
> >> > We didn't
> >> > face such a scenario though, and I think it's probably a rare one?
> >>
> >> As rare as any time you want to update an indexed-only field.  Not a
> >> serious limitation (but perhaps worth noting?)
> >> Perhaps at the API level, you provide an updateIndexedOnlyField that
> >> takes the old value as well as the new value for the field.
> >>
> >> Anyway, I think your approach would be a great addition to the
> >> toolkit. Would love to see it even in rough cut form :))
> >>
> >> -Babak
> >>
> >> On Sun, May 9, 2010 at 12:49 AM, Shai Erera <serera@gmail.com> wrote:
> >> > No, actually, you can update index-only fields also. It all depends on
> >> > the
> >> > operation that you ask to do on the fields. For example, if the query
> to
> >> > execute is something like "update all documents w/ tags:ibm -> remove
> >> > terms
> >> > t1, t2, t3 and add terms t4, t5", then the result of such request
> would
> >> > dissociate t1-3 from those docs that answer tags:ibm and associate
> them
> >> > w/
> >> > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the
> >> > following posting updates will be done:
> >> > t1: -1, -4, -10
> >> > t2: -1, -4, -10
> >> > t3: -1, -4, -10
> >> > t4, 1, 4, 10
> >> > t5, 1, 4, 10
> >> > (in addition to whatever other updates that are associated with those
> >> > postings).
> >> >
> >> > At search time, if you search for "t1 OR t2", then the regular t1 and
> t2
> >> > postings will be merged on-the-fly w/ the updated ones to remove docs
> 1,
> >> > 4,
> >> > 10.
> >> >
> >> > If you want to e.g. remove an entire field w/ such update operation,
> >> > then it
> >> > becomes more expensive, but in general you'd need to iterate over the
> >> > field's terms and dissociate the documents from all the terms. We
> didn't
> >> > face such a scenario though, and I think it's probably a rare one?
> >> >
> >> > Shai
> >> >
> >> > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farhang@gmail.com>
> wrote:
> >> >>
> >> >> Shai,
> >> >>
> >> >> I think this is an interesting approach. I can see how I could
> >> >> [incrementally] update a stored, indexed field this way, but I don't
> >> >> understand how I could update an indexed-only field. Here's why: for
> a
> >> >> stored (and indexed) field, I can always determine what terms to
> >> >> remove ('-') from the postings, but for an indexed-only field I'd
> have
> >> >> no [practical] way to know..
> >> >>
> >> >> So under this approach,  I'm thinking at a user level, a Lucene field
> >> >> would be updateable only if it's at least stored.
> >> >>
> >> >> Is that right?
> >> >>
> >> >> -Babak
> >> >>
> >> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <serera@gmail.com>
> wrote:
> >> >> > Yes Mike - I don't know yet if two MPs will be used, one for the
> >> >> > stacked
> >> >> > segments and one for the general segments (which will include
the
> >> >> > stacked
> >> >> > ones as well) .. feels like one MP should be enough, but this
can
> be
> >> >> > decided
> >> >> > on as we progress.
> >> >> >
> >> >> > This approach allows you to update every term in every already
> >> >> > indexed
> >> >> > field, as well as add terms to already indexed fields ... and
add
> >> >> > totally
> >> >> > new fields, with lots of text in them. So e.g. there are two neat
> use
> >> >> > cases
> >> >> > that come to mind:
> >> >> > 1) Allow users to rate search results, favor them etc.
> >> >> > 2) Or even comment them,
> >> >> > I think Google offers the 2nd. Both translate into updating the
> >> >> > search
> >> >> > result's already indexed document w/ the new rating, comment etc.
> w/o
> >> >> > needing to reindex the document.
> >> >> >
> >> >> > I will try to find perf results numbers. It's been long time ago,
> >> >> > hope
> >> >> > all
> >> >> > the documents are still where they were :). Indeed, for terms
like
> >> >> > ACLs,
> >> >> > it
> >> >> > means that each query had to merge dozens of postings lists. For
> that
> >> >> > I
> >> >> > implemented an alternative solution, which uses a payload-like
> >> >> > structure
> >> >> > that registers for each document the list of ACLs that are
> associated
> >> >> > with
> >> >> > it (not as strings, it was more efficient). Then, if the query
> >> >> > included
> >> >> > dozens of such terms, I created a filter-like object which for
> every
> >> >> > matching document by the query checked if it matches the ACLs
list
> of
> >> >> > the
> >> >> > document. This is usually slower, because the ACLs themselves
don't
> >> >> > drive
> >> >> > the query, which means more matches will be found. That was a
> >> >> > tradeoff
> >> >> > which
> >> >> > one could configure based on the number of such terms in the query,
> >> >> > the
> >> >> > number of updated terms etc.
> >> >> >
> >> >> > But in essence you're right - if the solution is generic to cover
> any
> >> >> > term,
> >> >> > we cannot use such payload-based feature. We might need to merge
> the
> >> >> > stacked
> >> >> > segments more frequently. This is pending perf testing though.
> >> >> >
> >> >> > Shai
> >> >> >
> >> >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless
> >> >> > <lucene@mikemccandless.com> wrote:
> >> >> >>
> >> >> >> Catching up here :)
> >> >> >>
> >> >> >> This is great stuff Shai -- I like the notion of "negative"
> >> >> >> postings,
> >> >> >> that "subtract" docs from previous generations as you iterate
> them.
> >> >> >>
> >> >> >> And I like the term "stacked segments".  This fits very well
with
> >> >> >> Lucene's write-once approach, ie, a writer can at any time
stack a
> >> >> >> new
> >> >> >> segment (writes to new files) "over" an old segment, like
the
> layers
> >> >> >> in photoshop.  A reader merges all stacks on-the-fly when
reading.
> >> >> >>
> >> >> >> And the merge policy now picks from 2 dimensions right?  Ie
it may
> >> >> >> want to simply consolidate stacks on an old segment but otherwise
> >> >> >> not
> >> >> >> merge that segment (eg for very large segments that have
> accumulated
> >> >> >> updates), and normal merging would of course consolidate all
> stacks
> >> >> >> for all segments merged.
> >> >> >>
> >> >> >> Wouldn't this approach conceivably allow you to alter single
terms
> >> >> >> within a single field (we'd have to figure out how we'd expose
the
> >> >> >> API
> >> >> >> for this)?  EG if I appended some text to an already-indexed
> field?
> >> >> >> (In addition to adding a new field to an already indexed doc,
or
> >> >> >> replacing an indexed field on a previously indexed doc).
> >> >> >>
> >> >> >> Did you have any hard perf numbers?  Merge sorting N streams
is
> >> >> >> surprisingly costly... we may need/want to have a reader pre-merge
> >> >> >> (using up RAM) any "long tail" of stacked segments as long
as they
> >> >> >> are
> >> >> >> small enough...
> >> >> >>
> >> >> >> This sounds great!!
> >> >> >>
> >> >> >> Mike
> >> >> >>
> >> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <serera@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > WARNING: following email is a bit long, but I think is
worth the
> >> >> >> > reading
> >> >> >> > :)
> >> >> >> >
> >> >> >> > I would like to describe my implementation of incremental
field
> >> >> >> > updates
> >> >> >> > in Juru (the former search library I've worked on in
IBM). I
> will
> >> >> >> > later
> >> >> >> > discuss how I think it can be implemented in Lucene.
> >> >> >> >
> >> >> >> > The motivation/requirement came from a doc management
system
> which
> >> >> >> > used
> >> >> >> > Juru as its search component. The system included document
> >> >> >> > libraries
> >> >> >> > where users could create files and upload documents.
A user
> could
> >> >> >> > belong
> >> >> >> > to any number of libraries and complex ACLs model was
used (down
> >> >> >> > to
> >> >> >> > the
> >> >> >> > level of the file). ACLs and Folders were modeled as
categories
> in
> >> >> >> > the
> >> >> >> > index (boolean-like terms). Files and folders could be
moved
> >> >> >> > around
> >> >> >> > and
> >> >> >> > access to a library/folder/file could be granted/revoked
at any
> >> >> >> > given
> >> >> >> > time. Therefore, such updates usually affected hundreds
(and
> >> >> >> > thousands)
> >> >> >> > of documents. Overall, the index managed millions of
documents,
> >> >> >> > tens
> >> >> >> > of
> >> >> >> > thousands of libraries and hundreds of thousands of ACLs
(large
> >> >> >> > organization :)). To get a rough understanding on the
number of
> >> >> >> > such
> >> >> >> > updates - every several minutes, tens of thousands of
documents
> >> >> >> > were
> >> >> >> > updated due to such changes only (in addition to the
regular
> >> >> >> > content
> >> >> >> > updates).
> >> >> >> >
> >> >> >> > We were asked to support requests in the following form:
"update
> >> >> >> > all
> >> >> >> > docs
> >> >> >> > that match <criteria> --> do <operation>"
where:
> >> >> >> > * <criteria> was "a single doc", "docs belonging
to a category"
> >> >> >> > and
> >> >> >> > "docs
> >> >> >> > belonging to a set of categories".
> >> >> >> > * <operation> was "add categories NEW" (NEW might
not even exist
> >> >> >> > in
> >> >> >> > the
> >> >> >> > index yet, or already associated w/ the document), "remove
> >> >> >> > categories
> >> >> >> > OLD"
> >> >> >> > (irregardless if the docs were already associated w/
OLD or not)
> >> >> >> > and
> >> >> >> > "remove all OLD and add all NEW".
> >> >> >> >
> >> >> >> > The solution I've implemented to support these requests
turned
> out
> >> >> >> > to
> >> >> >> > actually allow you to update every term (!) in the index:
> suppose
> >> >> >> > that
> >> >> >> > you have a table, which recorded tuples like <docid,
term, +/->.
> >> >> >> > The
> >> >> >> > record <1, "ibm", '+'> means that doc 1 is associated
w/ the
> term
> >> >> >> > "ibm",
> >> >> >> > and the record <1, "hp", '-'> means that the same
document is
> not
> >> >> >> > associated w/ the word "hp". Then, you could very easily
ask for
> >> >> >> > all
> >> >> >> > documents that are assoicated w/ "hp", and the result
would not
> >> >> >> > include
> >> >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the
internal
> >> >> >> > id
> >> >> >> > the
> >> >> >> > document received.
> >> >> >> >
> >> >> >> > I've kept two types of postings in the index: regular
and
> updates.
> >> >> >> > Taking the above examples, "ibm" regular posting looked
like
> >> >> >> > <"ibm",
> >> >> >> > 1,
> >> >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked
like
> >> >> >> > <"ibm",
> >> >> >> > +2,
> >> >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign).
> Similarly
> >> >> >> > for
> >> >> >> > "hp".
> >> >> >> >
> >> >> >> > During search time, when a query with the word "ibm"
was
> >> >> >> > submitted, I
> >> >> >> > create a virtual posting which reads from both the regular
and
> the
> >> >> >> > updates, and merges them on the fly according to the
+/- signs.
> >> >> >> > Since
> >> >> >> > both postings are sorted in ascending order, the merge
is very
> >> >> >> > efficient, and query time is hardly affected.
> >> >> >> >
> >> >> >> > Those postings are merged from time to time in a process
that is
> >> >> >> > similar
> >> >> >> > to how Lucene works today, which keeps the update postings
> >> >> >> > relatively
> >> >> >> > small and manageable.
> >> >> >> >
> >> >> >> > Now here comes the fun part - how I think it can be implemented
> in
> >> >> >> > Lucene !
> >> >> >> >
> >> >> >> > To be honest, this sat on my TODO list for a very long
time and
> >> >> >> > only
> >> >> >> > a
> >> >> >> > couple of months ago I figured out how to implement it
in
> Lucene.
> >> >> >> > The
> >> >> >> > main difficulty I had was around the difference between
the
> >> >> >> > write-once
> >> >> >> > policy in Juru and Lucene - in Lucene, once a segment
is
> written,
> >> >> >> > it
> >> >> >> > cannot be changed. BUT, I've only recently realized that
this
> >> >> >> > isn't
> >> >> >> > exactly true, because deleted docs do change existing
segments.
> >> >> >> > The
> >> >> >> > deletes are kept in a separate file to the segment (.del)
and
> have
> >> >> >> > their
> >> >> >> > own generation. Deletes, as I understood then, and Grant
helped
> me
> >> >> >> > term
> >> >> >> > them better, can be defined as "Stacked Segments" - they
add
> data
> >> >> >> > to
> >> >> >> > a
> >> >> >> > segment, which from time to time are integrated into
the segment
> >> >> >> > (unlike
> >> >> >> > Photoshop Layers, but my understanding of Photoshop is
limited).
> >> >> >> > And
> >> >> >> > the
> >> >> >> > Lucene engine knows how to combine the two, giving precedence
to
> >> >> >> > the
> >> >> >> > deletes.
> >> >> >> >
> >> >> >> > By introducing an "Updates Stacked Segment", we can encode
> >> >> >> > postings
> >> >> >> > w/
> >> >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested,
we
> >> >> >> > can
> >> >> >> > create a variation which merges the two lists. When segments
are
> >> >> >> > merged,
> >> >> >> > the updates will be merged into the regular postings
(just like
> >> >> >> > deletes)
> >> >> >> > and thus will be gone. In addition, this plays very nicely
with
> >> >> >> > readers
> >> >> >> > that are currently reading the index, as well as we can
have
> >> >> >> > generations
> >> >> >> > for the updates - really like deletes !
> >> >> >> >
> >> >> >> > I think that Lucene's architecture allows for such a
solution
> very
> >> >> >> > cleanly and nicely (and I believe flex makes it even
easier). We
> >> >> >> > can
> >> >> >> > (later, after you've digested the idea) discuss whether
this
> >> >> >> > should
> >> >> >> > be
> >> >> >> > built into the current IW, or an extension like UpdateableIW.
> The
> >> >> >> > API
> >> >> >> > I've been thinking about should really be like deletes,
allowing
> >> >> >> > to
> >> >> >> > update docs based on Term or Query. I defer the API discussion
> for
> >> >> >> > later
> >> >> >> > for now.
> >> >> >> >
> >> >> >> > As for stored fields, this was a real challenge to support
in
> >> >> >> > Juru,
> >> >> >> > but
> >> >> >> > I think that w/ "Stacked Segments" and Lucene's architecture,
> this
> >> >> >> > should
> >> >> >> > be much easier - adding stacked stored fields ...
> >> >> >> >
> >> >> >> > As you've noticed, the update postings are not DGap encoded,
and
> >> >> >> > sign
> >> >> >> > needs to be preserved. While I haven't implemented it
in Juru, I
> >> >> >> > think
> >> >> >> > that perhaps this can be improved by keeping the '-'
and '+'
> lists
> >> >> >> > separated. We will need to register somewhere which came
before
> >> >> >> > which
> >> >> >> > because order matters a lot here (and I'm not talking
about
> >> >> >> > concurrency
> >> >> >> > - simple update instructions order). I have some idea
how this
> can
> >> >> >> > be
> >> >> >> > achieved, but I refrain from describing it now, to not
make this
> >> >> >> > email
> >> >> >> > even longer :).
> >> >> >> >
> >> >> >> > I've mentioned that this approach can be applied to any
term and
> >> >> >> > not
> >> >> >> > just categories under some circumstances. Basically,
as soon as
> >> >> >> > you
> >> >> >> > update a term, its DF is no longer true, unless you are
able to
> >> >> >> > take
> >> >> >> > the
> >> >> >> > updates into account. We can defer the discussion on
that, but
> >> >> >> > clearly
> >> >> >> > for many fields, incrementally update them should not
affect
> >> >> >> > precision,
> >> >> >> > as they're not used for that type of scoring ... Maybe,
by
> keeping
> >> >> >> > separate '+' and '-' lists we can compute statistics
precisely.
> >> >> >> > And I
> >> >> >> > haven't given much thought yet to how this and Mike's
flex
> scoring
> >> >> >> > will
> >> >> >> > be integrated.
> >> >> >> >
> >> >> >> > BTW, a word on Parallel Indexing - the two are completely
> >> >> >> > orthogonal.
> >> >> >> > Once PI is introduced, one can index all the updateable
fields
> in
> >> >> >> > a
> >> >> >> > dedicated slice, for perhaps improving search performance
for
> >> >> >> > slices
> >> >> >> > that are not updateable (not involving code which attempts
to
> read
> >> >> >> > and
> >> >> >> > merge update and regular lists on the fly). Also, incremental
> >> >> >> > field
> >> >> >> > updates support all of PI's scenarios, even though some
will be
> >> >> >> > done
> >> >> >> > more efficiently w/ PI. But this too is a matter for
a separate
> >> >> >> > discussion :).
> >> >> >> >
> >> >> >> > That's it ! I believe I've given you all the details
I have
> about
> >> >> >> > the
> >> >> >> > approach and high level proposed solution for Lucene.
Perhaps
> some
> >> >> >> > details slipped my mind, but if you ask the right questions,
I'm
> >> >> >> > sure
> >> >> >> > I'll be able to answer them :). I would like to emphasize
that
> >> >> >> > since
> >> >> >> > this was already implemented (in Juru) - this is more
than just
> a
> >> >> >> > "I
> >> >> >> > think this approach can work" proposal ...
> >> >> >> >
> >> >> >> > I would appreciate your comments on this. I would like
to start
> >> >> >> > implementing it soon, and so as a first step, please
share your
> >> >> >> > comments
> >> >> >> > on the overall approach. I'll then write a more detailed
> >> >> >> > description
> >> >> >> > on
> >> >> >> > how I think to impl it in Lucene (been spending some
time on
> >> >> >> > that),
> >> >> >> > and
> >> >> >> > we can have more detailed (and fun) discussions on the
low level
> >> >> >> > details.
> >> >> >> >
> >> >> >> > Shai
> >> >> >> >
> >> >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <
> farhang@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Good point. I meant the model at the document level:
i.e. what
> >> >> >> >> milestones does a document go through in its life
cycle. Today:
> >> >> >> >>
> >> >> >> >> created --> deleted
> >> >> >> >>
> >> >> >> >> With incremental updates:
> >> >> >> >>
> >> >> >> >> created --> update1 --> update2 --> deleted
> >> >> >> >>
> >> >> >> >> I think what I'm trying to say is that this second
threaded
> >> >> >> >> sequence
> >> >> >> >> of state changes seems intuitively more fragile under
> concurrent
> >> >> >> >> scenarios.  So for example, in a lock-free design,
the system
> >> >> >> >> would
> >> >> >> >> also have to anticipate the following sequence of
events:
> >> >> >> >>
> >> >> >> >> created --> update1 --> deleted --> update2
> >> >> >> >>
> >> >> >> >> and consider update2 a null op.  I'm imagining there
are other
> >> >> >> >> cases
> >> >> >> >> that I can't think of..
> >> >> >> >>
> >> >> >> >> -Babak
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >> > write once, plus the option to the app to keep
multiple
> commit
> >> >> >> >> > points
> >> >> >> >> > around (by customizing the deletion policy).
> >> >> >> >> >
> >> >> >> >> > Actually order of operations / commits very
much matters in
> >> >> >> >> > Lucene
> >> >> >> >> > today.
> >> >> >> >> >
> >> >> >> >> > Deletions are not idempotent: if you add a doc
w/ term X,
> >> >> >> >> > delete
> >> >> >> >> > by
> >> >> >> >> > term X, add a new doc with term X... that's
very different
> than
> >> >> >> >> > if
> >> >> >> >> > you
> >> >> >> >> > moved the delete op to the end.  Ie the deletion
only applies
> >> >> >> >> > to
> >> >> >> >> > the
> >> >> >> >> > docs added before it.
> >> >> >> >> >
> >> >> >> >> > Mike
> >> >> >> >> >
> >> >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang
> >> >> >> >> > <farhang@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >> Sure. Because of the write once principle.
 But at some cost
> >> >> >> >> >> (duplicated data). I was just agreeing that
it would not be
> a
> >> >> >> >> >> good
> >> >> >> >> >> idea to bake in version-ing by keeping the
layers around
> >> >> >> >> >> forever
> >> >> >> >> >> in
> >> >> >> >> >> a
> >> >> >> >> >> merged index; I wasn't keying in on transactions
per se.
> >> >> >> >> >>
> >> >> >> >> >> Speaking of transactions: I'm not sure if
we should worry
> >> >> >> >> >> about
> >> >> >> >> >> this
> >> >> >> >> >> much yet, but with "updates" the order of
the transaction
> >> >> >> >> >> commits
> >> >> >> >> >> seems important. I think commit order is
less important
> today
> >> >> >> >> >> in
> >> >> >> >> >> Lucene because its model supports only 2
types of events:
> >> >> >> >> >> document
> >> >> >> >> >> creation--which only happens once, and document
deletion,
> >> >> >> >> >> which
> >> >> >> >> >> is
> >> >> >> >> >> idempotent.  What do you think? Will commits
have to be
> >> >> >> >> >> ordered
> >> >> >> >> >> if
> >> >> >> >> >> we
> >> >> >> >> >> introduce updates?  Or does the onus of
maintaining order
> fall
> >> >> >> >> >> on
> >> >> >> >> >> the
> >> >> >> >> >> application?
> >> >> >> >> >>
> >> >> >> >> >> -Babak
> >> >> >> >> >>
> >> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael
McCandless
> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak
Farhang
> >> >> >> >> >>> <farhang@gmail.com>
> >> >> >> >> >>> wrote:
> >> >> >> >> >>>>> I think they get merged in by
the merger, ideally in the
> >> >> >> >> >>>>> background.
> >> >> >> >> >>>>
> >> >> >> >> >>>> That sounds sensible. (In other
words, we wont concern
> >> >> >> >> >>>> ourselves
> >> >> >> >> >>>> with
> >> >> >> >> >>>> roll backs--something possible while
a "layer" is still
> >> >> >> >> >>>> around.)
> >> >> >> >> >>>
> >> >> >> >> >>> Actually roll backs would still be very
possible even if
> >> >> >> >> >>> layers
> >> >> >> >> >>> are
> >> >> >> >> >>> merged.
> >> >> >> >> >>>
> >> >> >> >> >>> Ie, one could keep multiple commits
around, and the older
> >> >> >> >> >>> commits
> >> >> >> >> >>> would still be referring to the old
postings + layers,
> >> >> >> >> >>> keeping
> >> >> >> >> >>> them
> >> >> >> >> >>> alive.
> >> >> >> >> >>>
> >> >> >> >> >>> Lucene would still be transactional
with such an approach.
> >> >> >> >> >>>
> >> >> >> >> >>> Mike
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >> >>> To unsubscribe, e-mail:
> >> >> >> >> >>> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> >>> For additional commands, e-mail:
> >> >> >> >> >>> java-dev-help@lucene.apache.org
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> For additional commands, e-mail:
> >> >> >> >> >> java-dev-help@lucene.apache.org
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> ---------------------------------------------------------------------
> >> >> >> >> > To unsubscribe, e-mail:
> java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> > For additional commands, e-mail:
> >> >> >> >> > java-dev-help@lucene.apache.org
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail:
> java-dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message