lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <karl.wri...@nokia.com>
Subject RE: Solr updateRequestHandler and performance vs. atomicity
Date Tue, 25 May 2010 09:26:59 GMT
Hi Simon,

I think you are on the right track.

I believe it is not even possible to write a middleware-style layer that stores documents
and performs periodic commits on its own, because the update request handler never ACKs individual
documents on a commit, but merely everything it has seen since the last time Solr bounced.
 So you have this potential scenario:

- middleware layer receives document 1, saves it
- middleware layer receives document 2, saves it
Now it's time for the commit, so:
- middleware layer sends document 1 to updateRequestHandler
- solr is restarted, dropping all uncommitted documents on the floor
- middleware layer sends document 2 to updateRequestHandler
- middleware layer sends COMMIT to updateRequestHandler, but solr adds only document 2 to
the index
- middleware believes incorrectly that it has successfully committed both documents

If I were any kind of mathematician, I suspect I could even prove that the current API has
this inherent race condition built into its semantics.

I never claimed this was going to be easy :-).  But it does seem to be valuable, perhaps critically
so.

Karl

________________________________________
From: ext Simon Willnauer [simon.willnauer@googlemail.com]
Sent: Monday, May 24, 2010 4:29 PM
To: dev@lucene.apache.org
Subject: Re: Solr updateRequestHandler and performance vs. atomicity

Hi Karl,

what are you describing seems to be a good usecase for something like
a message queue where you push a document or record to a queue which
guarantees the queues persistence. I look at this from a little
different perspective, in a distributed environment you would have to
guarantee delivery to a single solr instance but on several or at
least n instances but that is a different story.

>From a Solr point of view this sounds like a need for a write-ahead
log that guarantees durability and atomicity. I like this idea as it
might also solve lots of problems in distributed environments (solr
cloud) etc.

Very interesting topic - should investigate more in this direction....


simon


On Mon, May 24, 2010 at 10:03 PM,  <karl.wright@nokia.com> wrote:
> Hi Mark,
>
> Unfortunately, indexing performance *is* of concern, otherwise I'd already be committing
on every post.
>
> If your guess is correct, you are basically saying that adding a document to an index
in Solr/Lucene is just as fast as writing that file directly to the disk.  Because, obviously,
if we want guaranteed delivery, that's what we'd have to do.  But I think this is worth the
experiment - Solr/Lucene may be fast, but I have doubts that it can perform as well as raw
disk I/O and still manage to do anything in the way of document analysis or (heaven forbid)
text extraction.
>
>
>
> -----Original Message-----
> From: ext Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Monday, May 24, 2010 3:33 PM
> To: dev@lucene.apache.org
> Subject: Re: Solr updateRequestHandler and performance vs. atomicity
>
> On 5/24/10 3:10 PM, karl.wright@nokia.com wrote:
>> Hi all,
>> It seems to me that the "commit" logic in the Solr updateRequestHandler
>> (or wherever the logic is actually located) conflates two different
>> semantics. One semantic is what you need to do to make the index process
>> perform well. The other semantic is guaranteed atomicity of document
>> reception by Solr.
>> In particular, it would be nice to be able to post documents in such a
>> way that you can guarantee that the document is permanently in Solr's
>> queue, safe in the event of a Solr restart, etc., even if the document
>> has not yet been "committed".
>> This issue came up in the LCF talk that I gave, and I initially thought
>> that separating the two kinds of events would necessarily be an LCF
>> change, but the more I thought about it the more I realized that other
>> Solr indexing clients may also benefit from such a separation.
>> Does anyone agree? Where should this logic properly live?
>> Thanks,
>> Karl
>
> Its an interesting idea - but I think you would likely pay a similar
> cost to guarantee reception as you would to commit (also, I'm not sure
> Lucene guarantees it - it works for consistency, but I'm not so sure it
> achieves durability).
>
> I can think of two things offhand -
>
> Perhaps store the text and use fsync to quasi guarantee acceptance -
> then index from the store on the commit.
>
> Another simpler idea if only the separation is important and not the
> performance - index to another side index, taking advantage of Lucene's
> current commit functionality, and then use addIndex to merge to the main
> index on commit.
>
> Just spit balling though.
>
> I think this would obviously need to be an optional mode.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message