couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: replicator test hanging
Date Thu, 10 Jun 2010 17:27:17 GMT
Thanks Paul!  Good sleuthing.  We'll get it fixed,

Adam

On Jun 10, 2010, at 11:43 AM, Paul Bonser wrote:

> Ok, so I've tracked it down to the specific location where it happens
> 
> - couch_rep_reader:spawn_document_request/2 is called
> - in the SpawnFun defined in there, it calls couch_rep_reader:open_doc
> - open_doc gets an error, not_found response (not sure why, shouldn't the
> doc be there already?)
> - open_doc returns [] back to the SpawnFun
> - SpawnFun calls gen_server:call(Server, {add_docs, nil, Results}... with
> Results being []
> - handle_call(add_docs) calls handle_add_docs, which increments the document
> count..by 0..
> - and then returns {noreply,...}
> - then everything just sits there, because each part is waiting for another
> part to do something
> 
> It seems the solution here is to either add a retry into
> spawn_document_request's SpawnFun, or at the very least, fail when open_doc
> returns [], rather than continuing on, since that results in a set of
> deadlocked processes.
> 
> On Thu, Jun 10, 2010 at 9:28 AM, Paul Bonser <misterpib@gmail.com> wrote:
> 
>> Nope, just a regular 7200RPM SATA drive.
>> 
>> So you guys may already know tihs, but I've tracked it down to a couch_rep
>> gen_server never terminating, and thus not calling do_terminate, and thus
>> the call to gen_server:call(Server, get_result, infinity) in
>> couch_rep:get_result just hangs forever.
>> 
>> 
>> On Thu, Jun 10, 2010 at 4:39 AM, Jan Lehnardt <jan@apache.org> wrote:
>> 
>>> Hi Paul,
>>> 
>>> thanks for the report. Out of curiosity, are you running an SSD drive in
>>> the box that reproduces the hangs?
>>> 
>>> And anyone: Can you reproduce this on non-SSD machines?
>>> 
>>> Cheers
>>> Jan
>>> --
>>> 
>>> On 10 Jun 2010, at 02:26, Paul Bonser wrote:
>>> 
>>>> Oh, I should also mention that I got the exact same error in multiple
>>>> freezes. Twice it was in the same exact order, and once it was in this
>>>> order:
>>>> 
>>>> [info] [<0.95.0>] starting replication
>>> "15c25eda4ea6308af6bea9864d5319ef" at
>>>> <0.1845.0>
>>>> [debug] [<0.1207.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>> [info] [<0.1207.0>] 127.0.0.1 - - 'GET'
>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>>>> [debug] [<0.1207.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1}
>>>> Headers: [{'Accept',"application/json"},
>>>>         {'Accept-Encoding',"gzip"},
>>>>         {'Content-Length',"167"},
>>>>         {'Host',"localhost:5985"},
>>>>         {'User-Agent',"CouchDB/0.12.0a953193"},
>>>>         {"X-Couch-Full-Commit","false"}]
>>>> [debug] [<0.1207.0>] OAuth Params: []
>>>> [info] [<0.1207.0>] 127.0.0.1 - - 'POST'
>>>> /test_suite_rep_docs_db_b/_bulk_docs 201
>>>> [debug] [<0.1076.0>] 'GET'
>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>>>> Headers: [{'Accept',"application/json"},
>>>>         {'Accept-Encoding',"gzip"},
>>>>         {'Host',"localhost:5985"},
>>>>         {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>> [debug] [<0.1076.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>> [debug] [<0.1076.0>] Minor error in HTTP request: {not_found,missing}
>>>> [debug] [<0.1076.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>>>>            {couch_httpd_db,db_doc_req,3},
>>>>            {couch_httpd_db,do_db_req,2},
>>>>            {couch_httpd,handle_request_int,5},
>>>>            {mochiweb_http,headers,5},
>>>>            {proc_lib,init_p_do_apply,3}]
>>>> [info] [<0.1076.0>] 127.0.0.1 - - 'GET'
>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>>>> [debug] [<0.1076.0>] httpd 404 error response:
>>>> {"error":"not_found","reason":"missing"}
>>>> 
>>>> 
>>>> Could it be some sort of race condition?
>>>> 
>>>> 
>>>> 
>>>> On Wed, Jun 9, 2010 at 8:22 PM, Paul Bonser <misterpib@gmail.com>
>>> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jun 9, 2010 at 7:33 PM, J Chris Anderson <jchris@apache.org
>>>> wrote:
>>>>> 
>>>>>> Devs,
>>>>>> 
>>>>>> Is anyone else seeing the replicator test hang and never finish?
>>>>>> 
>>>>>> It never hangs the first few runs, but after running ten or so times,
>>> I'll
>>>>>> end up with the test suite waiting for a replication that never
>>> finishes.
>>>>>> It's the same story on 0.11.0, 0.11.x, and trunk.
>>>>>> 
>>>>>> Is anyone else able to reproduce this? Am I crazy?
>>>>>> 
>>>>> 
>>>>> It just froze for me on the first try, using 0.12.0a935298, then ran
>>>>> successfully 3 times, then froze again. The last thing logged the first
>>> time
>>>>> was a _bulk_docs requests, the last thing logged this time was a PUT
to
>>>>> /test_suite_db_b/_local%2F6598a76aa55cd39645e4730b4cb28c00
>>>>> 
>>>>> I'm running a Firefox 3.6 nightly build on Linux. For me, it froze the
>>>>> first time when I did a "run all" and the second time when just
>>> directly
>>>>> running the replication test.
>>>>> 
>>>>> After svn up-ing to the latest in trunk, it froze on the first try,
>>>>> directly running the replication test.
>>>>> 
>>>>> Here's the debug output for the last _replicate request where it
>>> freezes.
>>>>> It's requesting a document that isn't there.
>>>>> 
>>>>> 
>>>>> [info] [<0.95.0>] starting new replication
>>>>> "15c25eda4ea6308af6bea9864d5319ef" at <0.848.0>
>>>>> [debug] [<0.191.0>] 'GET'
>>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true {1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>>         {'Accept-Encoding',"gzip"},
>>>>>         {'Host',"localhost:5985"},
>>>>>         {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>>> [debug] [<0.191.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>>> [info] [<0.191.0>] 127.0.0.1 - - 'GET'
>>>>> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>>>>> [debug] [<0.189.0>] 'GET'
>>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>>         {'Accept-Encoding',"gzip"},
>>>>>         {'Host',"localhost:5985"},
>>>>>         {'User-Agent',"CouchDB/0.12.0a953193"}]
>>>>> [debug] [<0.189.0>] OAuth Params: [{"att_encoding_info","true"}]
>>>>> [debug] [<0.189.0>] Minor error in HTTP request: {not_found,missing}
>>>>> [debug] [<0.189.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>>>>>            {couch_httpd_db,db_doc_req,3},
>>>>>            {couch_httpd_db,do_db_req,2},
>>>>>            {couch_httpd,handle_request_int,5},
>>>>>            {mochiweb_http,headers,5},
>>>>>            {proc_lib,init_p_do_apply,3}]
>>>>> [info] [<0.189.0>] 127.0.0.1 - - 'GET'
>>>>> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>>>>> [debug] [<0.189.0>] httpd 404 error response:
>>>>> {"error":"not_found","reason":"missing"}
>>>>> 
>>>>> [debug] [<0.191.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs
{1,1}
>>>>> Headers: [{'Accept',"application/json"},
>>>>>         {'Accept-Encoding',"gzip"},
>>>>>         {'Content-Length',"167"},
>>>>>         {'Host',"localhost:5985"},
>>>>>         {'User-Agent',"CouchDB/0.12.0a953193"},
>>>>>         {"X-Couch-Full-Commit","false"}]
>>>>> [debug] [<0.191.0>] OAuth Params: []
>>>>> [info] [<0.191.0>] 127.0.0.1 - - 'POST'
>>>>> /test_suite_rep_docs_db_b/_bulk_docs 201
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Paul Bonser
>>>>> http://probablyprogramming.com
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Paul Bonser
>>>> http://probablyprogramming.com
>>> 
>>> 
>> 
>> 
>> --
>> Paul Bonser
>> http://probablyprogramming.com
>> 
> 
> 
> 
> -- 
> Paul Bonser
> http://probablyprogramming.com


Mime
View raw message