lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tanguy Moal <>
Subject Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances
Date Wed, 01 Jun 2011 13:23:34 GMT

Thank you very much for your answer.

Using the signature field as the uniqueKey is effectively what I was 
doing, so the "overwriteDupes=true" parameter in my solrconfig was 
somehow redundant, although I wasn't aware of it! =D

In practice it works perfectly and that's the nice part.

By the way, I wonder what happens when we enter in the following code 
snippet when the id field is the same as the signature field, from 
addDoc@DirectUpdateHandler2(AddUpdateCommand) :
>       if(del) { // ensure id remains unique
>           BooleanQuery bq = new BooleanQuery();
>           bq.add(new BooleanClause(new TermQuery(updateTerm), 
> Occur.MUST_NOT));
>           bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST));
>           writer.deleteDocuments(bq);
>         }
May be all my problems started from here...

I'll try to reproduce using a different uniqueKey field and turning 
overwriteDupes back to "on" to see if the problem was because of the 
signature field being the same as the uniqueKey field *and* having 
overwriteDupes on, when I'll have some time. If so, maybe that a simple 
configuration check should be performed to avoid the issue. Otherwise it 
means that having overwriteDupes turned on simply doesn't scale and that 
should be added to the wiki's Deduplication page, IMHO.

Thank you again.


On 31/05/2011 14:58, lee carroll wrote:
> Tanguy
> You might have tried this already but can you set overwritedupes to
> false and set the signiture key to be the id. That way solr
> will manage updates?
> from the wiki
> <!-- An example dedup update processor that creates the "id" field on the fly
>         based on the hash code of some other fields.  This example has
> overwriteDupes
>         set to false since we are using the id field as the
> signatureField and Solr
>         will maintain uniqueness based on that anyway. -->
> Lee
> On 30 May 2011 08:32, Tanguy Moal<>  wrote:
>> Hello,
>> Sorry for re-posting this but it seems my message got lost in the mailing list's
messages stream without hitting anyone's attention... =D
>> Shortly, has anyone already experienced dramatic indexing slowdowns during large
bulk imports with overwriteDupes turned on and a fairly high duplicates rate (around 4-8x)
>> It seems to produce a lot of deletions, which in turn appear to make the merging
of segments pretty slow, by fairly increasing the number of little reads operations occuring
simultaneously with the regular large write operations of the merge. Added to the poor IO
performances of a commodity SATA drive, indexing takes ages.
>> I temporarily bypassed that limitation by disabling the overwriting of duplicates,
but that changes the way I request the index, requiring me to turn on field collapsing at
search time.
>> Is this a known limitation ?
>> Has anyone a few hints on how to optimize the handling of index time deduplication
>> More details on my setup and the state of my understanding are in my previous message
>> Thank you very much in advance.
>> Regards,
>> Tanguy
>> On 05/25/11 15:35, Tanguy Moal wrote:
>>> Dear list,
>>> I'm posting here after some unsuccessful investigations.
>>> In my setup I push documents to Solr using the StreamingUpdateSolrServer.
>>> I'm sending a comfortable initial amount of documents (~250M) and wished to perform
overwriting of duplicated documents at index time, during the update, taking advantage of
the UpdateProcessorChain.
>>> At the beginning of the indexing stage, everything is quite fast; documents arrive
at a rate of about 1000 doc/s.
>>> The only extra processing during the import is computation of a couple of hashes
that are used to identify uniquely documents given their content, using both stock (MD5Signature)
and custom (derived from Lookup3Signature) update processors.
>>> I send a commit command to the server every 500k documents sent.
>>> During a first period, the server is CPU bound. After a short while (~10 minutes),
the rate at which documents are received starts to fall dramatically, the server being IO
>>> I've been firstly thinking of a normal speed decrease during the commit, while
my push client is waiting for the flush to occur. That would have been a normal slowdown.
>>> The thing that retained my attention was the fact that unexpectedly, the server
was performing a lot of small reads, way more the number writes, which seem to be larger.
>>> The combination of the many small reads with the constant amount of bigger writes
seem to be creating a lot of IO contention on my commodity SATA drive, and the ETA of my built
index started to increase scarily =D
>>> I then restarted the JVM with JMX enabled so I could start investigating a little
bit more. I've the realized that the UpdateHandler was performing many reads while processing
the update request.
>>> Are there any known limitations around the UpdateProcessorChain, when overwriteDupes
is set to true ?
>>> I turned that off, which of course breaks the intent of my built index, but for
comparison purposes it's good.
>>> That did the trick, indexing is fast again, even with the periodic commits.
>>> I therefor have two questions, an interesting first  one and a boring second
one :
>>> 1 / What's the workflow of the UpdateProcessorChain when one or more processors
have overwriting of duplicates turned on ? What happens under the hood ?
>>> I tried to answer that myself looking at DirectUpdateHandler2 and my understanding
stopped at the following :
>>> - The document is added to the lucene IW
>>> - The duplicates are deleted from the lucene IW
>>> The dark magic I couldn't understand seems to occur around the idTerm and updateTerm
things, in the addDoc method. The deletions seem to be buffered somewhere, I just didn't get
it :-)
>>> I might be wrong since I didn't read the code more than that, but the point might
be at how does solr handles deletions, which is something still unclear to me. In anyways,
a lot of reads seem to occur for that precise task and it tends to produce a lot of IO, killing
indexing performances when overwriteDupes is on. I don't even understand why so many read
operations occur at this stage since my process had a comfortable amount of RAM (with Xms=Xmx=8GB),
with only 4.5GB are used so far.
>>> Any help, recommandation or idea is welcome :-)
>>> 2 / In the case there isn't a simple fix for this, I'll have to do with duplicates
in my index. I don't mind since solr offers a great grouping feature, which I already use
in some other applications. The only thing I don't know yet is that if I do rely on grouping
at search time, in combination with the Stats component (which is the intent of that index),
and limiting the results to 1 document per group, will the computed statistics take those
duplicates into account or not ? Shortly, how well does the Stats component behave when combined
to hits collapsing ?
>>> I had firstly implemented my solution using overwriteDupes because it would have
reduced both the target size of my index and the complexity of queries used to obtain statistics
on the search results, at one time.
>>> Thank you very much in advance.
>>> --
>>> Tanguy
>> --
>> --
>> Tanguy


View raw message