lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] [Commented] (SOLR-3473) Distributed deduplication broken
Date Mon, 21 May 2012 17:00:42 GMT


Mark Miller commented on SOLR-3473:

bq. To work around the problem of having the digest field as ID, could it not simply issue
a deleteByQuery for the digest prior to adding it? Would that cause significant overhead for
very large systems with many updates?

Yeah, that might be an option - I don't know that it will be great perf wise, or race airtight
wise, but it may a viable option.

bq. We would, from Nutch' point of view, certainly want to avoid changing the ID from URL
to digest.

Ah, interesting. If you are enforcing uniqueness by digest though, is this really a problem?
It would only have to be in the Solr world that the id was the digest - and you could even
call it something else and have an id:url field as well. Just thinking out loud.

Or, perhaps we could make it so you could pick the hash field? Then hash on digest. If you
are using overwrite=true, this should work right?

Or perhaps someone else has some ideas...
> Distributed deduplication broken
> --------------------------------
>                 Key: SOLR-3473
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, update
>    Affects Versions: 4.0
>            Reporter: Markus Jelsma
>             Fix For: 4.0
> Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates
on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently
work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert
update commands into solr documents - and that can cause a loss of info if an update proc
modifies the update command.
> I think the reason that you see a multiple values error when you try the other order
is because of the lack of a document clone (the other issue I mentioned a few emails back).
Addressing that won't solve your issue though - we have to come up with a way to propagate
the currently lost info on the update command.
> {quote}
> Please see the ML thread for the full discussion:

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message