lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@answers.com>
Subject RE: Different replicas return different scores
Date Wed, 10 Feb 2010 06:59:01 GMT
Thanks for these directions, Ian.
We are running Lucene 2.9.1 on CentOs 5 64-bit machines.
We do use compound file format, and will look into replacing it with the simple files,
although I believe this will create too many files. 
We will also consider the rsync option.
Thanks again,
-- Yuval

-----Original Message-----
From: Ian Lea [mailto:ian.lea@gmail.com] 
Sent: Tuesday, February 09, 2010 6:13 PM
To: java-user@lucene.apache.org
Subject: Re: Different replicas return different scores

Since the update commands may run in different order on different
shards you might get different sets of segments because merges happen
to be triggered at different points in the different batches of
updates.  But you shouldn't have different numbers of deleted docs if
you have really been applying the same updates to all the shards.
Could some updates have been missed?  Or docs added then deleted or
something?  Maybe there are other variations between the shards and
that is causing the variation in query scores.

As an alternative approach you could have one master index per shard
that takes all the updates and then send that index out to the shard
servers.  If you don't use compound file format, and don't optimize,
the file changes are typically quite small with default or sensible
merge settings and can be distributed quickly using rsync.  You can
have more control by using MergePolicy and friends.

What version of lucene are you running?


--
Ian.


On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yuvalf@answers.com> wrote:
> We are running a large sharded Lucene-based application.
> Our configuration supports near real-time updates, by incrementally
> Updating documents (using delete then add) on the shards.
> Every shard is replicated to several machines in order to improve performance.
> We replicate the shard by sending the same deletion and addition commands to all the
replicas,
> Where they may be performed in a different order. (We delete a set of documents, say
1000 at a time,
> Then add them one-by-one semi-asynchronously).
> Lately we have noticed a subtle difference in query scores across different replicas
of the same shard.
> Further investigation showed that the only noticeable difference between the replicas
was the index directory structure:
> 1.      Different replicas have different sets of segments - most segment files are
the same, but some are different.
> 2.      The numbers of deleted documents are different between two replicas of the
same shard.
> Is this a known behavior of Java Lucene?
> How can we change this behavior? We want different replicas returning the exact same
score per query hits.
> (We would rather not optimize the index as we believe this will harm performance.)
>
> TIA,
> Yuval and Ophir
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message