lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3473) Distributed deduplication broken
Date Mon, 21 May 2012 18:57:41 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280358#comment-13280358
] 

Hoss Man commented on SOLR-3473:
--------------------------------

i'm not entirely sure i'm understanding the problems. here's what i think i understand...

1) if you put dedup prior to distrib, then regardless of how it is configured it currently
runs twice, which is bad - this seems like it is solved by SOLR-2822

2) if you want to use dedup to generate a sig for the uniqueKey field, then it really *has*
to come before distrib, otherwise forwarding to the leader just wont work. (again: SOLR-2822
should make this do-able)

3) if you want to use dedup to generate a sig field that is *not* the uniqueKey field, *AND*
you want to use "overwriteDupes=true" then (currently) this needs to happen _after_ distrib,
because otherwise the info about the deletion -- tracked in 
AddUpdateCommand.updateTerm - is lost when distrib does the forward.  This seems like something
that the distrib processor should deal with by ensuring it serializes/deserializes all of
the key information in the AddUpdateCommand when sending/recieving a TOLEADER/FROMLEADER request
(using SOLR-2822 vernacular)

3a) it's not enough to ensure that the "updateTerm" is forwarded all the replicas in the shard,
because other docs in other shards may have the same term value for the hash. (hence Markus's
suggestions about doing a deleteByQuery -- this should be in distribUP when AddUpdateCommand.updateTerm
is non-null)

4) something about document cloning ... i still don't really understand this -- not just in
terms of dedup, but in generally i don't really understand why SOLR-3215 is an issue assuming
we fix SOLR-2822.
                
> Distributed deduplication broken
> --------------------------------
>
>                 Key: SOLR-3473
>                 URL: https://issues.apache.org/jira/browse/SOLR-3473
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, update
>    Affects Versions: 4.0
>            Reporter: Markus Jelsma
>             Fix For: 4.0
>
>
> Solr's deduplication via the SignatureUpdateProcessor is broken for distributed updates
on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently
work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert
update commands into solr documents - and that can cause a loss of info if an update proc
modifies the update command.
> I think the reason that you see a multiple values error when you try the other order
is because of the lack of a document clone (the other issue I mentioned a few emails back).
Addressing that won't solve your issue though - we have to come up with a way to propagate
the currently lost info on the update command.
> {quote}
> Please see the ML thread for the full discussion: http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message