lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <>
Subject [jira] [Commented] (SOLR-5860) Logging around core wait for state during startup / recovery is confusing
Date Thu, 20 Mar 2014 18:14:45 GMT


Shalin Shekhar Mangar commented on SOLR-5860:

I'm seeing some test failures with the patch. Ran it twice already. I have to call it a day
but if nobody else gets to it first, I'll debug tomorrow and commit.

   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.handler.component.TermVectorComponentDistributedTest.testDistribSearch
   [junit4]   - org.apache.solr.handler.component.DistributedExpandComponentTest.testDistribSearch
   [junit4]   - org.apache.solr.handler.component.DistributedSuggestComponentTest.testDistribSearch
   [junit4]   - org.apache.solr.TestDistributedGrouping.testDistribSearch
   [junit4]   - org.apache.solr.handler.component.DistributedTermsComponentTest.testDistribSearch
   [junit4]   - org.apache.solr.handler.component.DistributedSpellCheckComponentTest.testDistribSearch
   [junit4]   - org.apache.solr.TestDistributedMissingSort.testDistribSearch
   [junit4]   - org.apache.solr.TestDistributedSearch.testDistribSearch
   [junit4]   - org.apache.solr.handler.component.DistributedQueryComponentCustomSortTest.testDistribSearch

> Logging around core wait for state during startup / recovery is confusing
> -------------------------------------------------------------------------
>                 Key: SOLR-5860
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>         Attachments: SOLR-5860.patch
> I'm seeing some log messages like this:
> I was asked to wait on state recovering for HOST:8984_solr but I still do not see the
requested state. I see state: recovering live:true
> This is very confusing because from the log, it seems like it's waiting to see the state
it's in ... After digging through the code, it appears that it is really waiting for a leader
to become active so that it has a leader to recover from.
> I'd like to improve the logging around this critical wait loop to give better context
to what is happening. 
> Also, I would like to change the following so that we force state updates every 15 seconds
for the entire wait period.
> -          if (retry == 15 || retry == 60) {
> +          if (retry % 15 == 0) {
> As-is, it's waiting 120 seconds but only forcing the state to update twice, once after
15 seconds and again after 60 … might be good to force updates for the full wait period.
> Lastly, I think it would be good to use the leaderConflictResolveWait setting (from ZkController)
here as well since 120 may not be enough for a leader to become active in a busy cluster,
esp. after the node the Overseer is running on. Maybe leaderConflictResolveWait + 5 seconds?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message