lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-4629) Stronger standard replication testing.
Date Fri, 12 Apr 2013 01:59:16 GMT

     [ https://issues.apache.org/jira/browse/SOLR-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-4629:
---------------------------

    Attachment: SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch


After adding a lot of debug logs, and walking through the results of lots of failed tests
compared to successful tests, and a lot of vigorous, physical consultation between my forehead
and my desk i think i've finally tracked down the cause of all the "expected 2 got 3" failures
from checkForSingleIndex.

The problem in a nutshell is one of concurrency. When the test thread makes a request to the
master or to the slave those requests are handled by a jetty thread which (via SolrDispatchFilter)
creates a SolrQueryRequest, which has a searcher ref, which has a Directory ref.  When the
request is done, the SolrQueryRequest is closed, hich releases the searcher ref, which releases
the directory ref -- but by the time this happens, the response has already been returned
to the "client" (the test thread), and the test thread may enter checkForSingleIndex to acquire
the lock on the CacheDirectoryFactory (to check the list of cached paths) before the resources
from a previos rquest have been completely released -- so the test fails because an Directory
from an old request still hasn't been released.

Example...

{noformat}
Time   Test-Thread              Jetty-Thread-N
0      http request->jetty
1                               accept http request
2                               create solr query request
3                               incref searcher, incref dir
4                               process solr query request
5                               test thread<-write http response
6      process response
7      ...    
8      assert(2=num dirs)
9                               decref seracher, decref dir, release dir
{noformat}

I think the key change is to modify checkForSingleIndex so that instead of asserting exactly
2 paths in the cache, we assert that there are only 2 paths that are not "done" -- allowing
for the possibility of other paths still being tracked because of requests still being closed.


The attached patch makes this change -- there are still some nocommits (in particular i completely
commented out hte replication core reloading to rule that out as a possible cause, but there's
also some excessively absurd logging) but even if you ignore all that, after replacing "CachingDirectoryFactory.getPaths()"
with "CachingDirectoryFactory.getLivePaths()" I have yet to see "expected:<2> but was:<3>"
in any test run.  If you tweak that method to eliminate the "!val.doneWithDir" dir check,
you should start seeing the failures come back.

I'll clean the patch up more tomorow and run some more exhaustive tests to be sure i haven't
broken anything, but i wanted to post what i had in case i got hit by a buss (and to ensure
[~markrmiller@gmail.com] doesn't see any flaw with my "getLivePaths()" change before i get
too happy about it)


                
> Stronger standard replication testing.
> --------------------------------------
>
>                 Key: SOLR-4629
>                 URL: https://issues.apache.org/jira/browse/SOLR-4629
>             Project: Solr
>          Issue Type: Test
>          Components: replication (java)
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>             Fix For: 4.3, 5.0, 4.2.1
>
>         Attachments: SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch,
SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch, SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch,
SOLR-4629_emptycommittest_and_numfoundrefactor_and_waitparam.patch
>
>
> I added to these tests recently, but there is a report on the list indicating we may
still be missing something. Most reports have been positive so far after the 4.2 fixes, but
I'd feel better after adding some more testing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message