lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Drob (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-10006) Cannot do a full sync (fetchindex) if the replica can't open a searcher
Date Tue, 14 Feb 2017 14:39:41 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865865#comment-15865865
] 

Mike Drob commented on SOLR-10006:
----------------------------------

bq. The take-away here is that the solr core must be restarted so there is never an open searcher
on that core, perhaps your stress test isn't doing that?
Guilty.

bq. reloading the core from the admin UI silently fails with a .doc file removed. By that
I mean the UI doesn't show any problems even though the log file has exceptions.
this might be best as a separate issue. i don't feel nearly comfortable enough with the ui
to even begin to attempt to fix this.

bq. The core admin API correctly reports an error for action=RELOAD though (curl or the like)
Good.

bq. the admin UI still thinks the replica is active.
bq. a search on the replica with distrib=false also succeeds, even when I set a very large
start parameter, but I suspect this is a function there still being an open file handle on
the file I deleted so it's "kinda there" until restart.
I'm not sure this is wrong, based on your next points. If everything is in memory, and the
core can serve requests, then from the system perspective it _is_ active. It's either the
phantom file handle or everything is sitting in a cache.

bq. At this point (the searcher is working even thought the doc file is missing), a fetchindex
doesn't think there's any work to do so "succeeds", i.e. it doesn't fetch from the masterUrl
maybe we need a {{force=true}} option here? I'm not sure there is another way to do a robust
check that wouldn't be incredibly slow. maybe fetchindex is a rare enough command that it's
ok to be slow?

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> -----------------------------------------------------------------------
>
>                 Key: SOLR-10006
>                 URL: https://issues.apache.org/jira/browse/SOLR-10006
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 5.3.1, 6.4
>            Reporter: Erick Erickson
>         Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log, solr.log
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't open the searcher
those operations fail.
> For discussion. I've seen a situation in the field where a replica's index became corrupt.
When the node was restarted, the replica tried to do a full sync but fails because the core
can't open a searcher. The replica went into an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a similar
situation. Create a 2x2 collection and index some docs. Then stop one of the instances and
go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a searcher. Sure,
issuing a fetchindex is abusive.... but I think it's the same underlying issue: why should
we care about the state of a replica's current index when we're going to completely replace
it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message