couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Bonser <mister...@gmail.com>
Subject Re: replicator test hanging
Date Thu, 10 Jun 2010 15:43:16 GMT
Ok, so I've tracked it down to the specific location where it happens

- couch_rep_reader:spawn_document_request/2 is called
- in the SpawnFun defined in there, it calls couch_rep_reader:open_doc
- open_doc gets an error, not_found response (not sure why, shouldn't the
doc be there already?)
- open_doc returns [] back to the SpawnFun
- SpawnFun calls gen_server:call(Server, {add_docs, nil, Results}... with
Results being []
- handle_call(add_docs) calls handle_add_docs, which increments the document
count..by 0..
- and then returns {noreply,...}
- then everything just sits there, because each part is waiting for another
part to do something

It seems the solution here is to either add a retry into
spawn_document_request's SpawnFun, or at the very least, fail when open_doc
returns [], rather than continuing on, since that results in a set of
deadlocked processes.

On Thu, Jun 10, 2010 at 9:28 AM, Paul Bonser <misterpib@gmail.com> wrote:

> Nope, just a regular 7200RPM SATA drive.
>
> So you guys may already know tihs, but I've tracked it down to a couch_rep
> gen_server never terminating, and thus not calling do_terminate, and thus
> the call to gen_server:call(Server, get_result, infinity) in
> couch_rep:get_result just hangs forever.
>
>
> On Thu, Jun 10, 2010 at 4:39 AM, Jan Lehnardt <jan@apache.org> wrote:
>
>> Hi Paul,
>>
>> thanks for the report. Out of curiosity, are you running an SSD drive in
>> the box that reproduces the hangs?
>>
>> And anyone: Can you reproduce this on non-SSD machines?
>>
>> Cheers
>> Jan
>> --
>>
>> On 10 Jun 2010, at 02:26, Paul Bonser wrote:
>>
>> > Oh, I should also mention that I got the exact same error in multiple
>> > freezes. Twice it was in the same exact order, and once it was in this
>> > order:
>> >
>> > [info] [<0.95.0>] starting replication
>> "15c25eda4ea6308af6bea9864d5319ef" at
>> > <0.1845.0>
>> > [debug] [<0.1207.0>] OAuth Params: [{"att_encoding_info","true"}]
>> > [info] [<0.1207.0>] 127.0.0.1 - - 'GET'
>> > /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>> > [debug] [<0.1207.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1}
>> > Headers: [{'Accept',"application/json"},
>> >          {'Accept-Encoding',"gzip"},
>> >          {'Content-Length',"167"},
>> >          {'Host',"localhost:5985"},
>> >          {'User-Agent',"CouchDB/0.12.0a953193"},
>> >          {"X-Couch-Full-Commit","false"}]
>> > [debug] [<0.1207.0>] OAuth Params: []
>> > [info] [<0.1207.0>] 127.0.0.1 - - 'POST'
>> > /test_suite_rep_docs_db_b/_bulk_docs 201
>> > [debug] [<0.1076.0>] 'GET'
>> > /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>> > Headers: [{'Accept',"application/json"},
>> >          {'Accept-Encoding',"gzip"},
>> >          {'Host',"localhost:5985"},
>> >          {'User-Agent',"CouchDB/0.12.0a953193"}]
>> > [debug] [<0.1076.0>] OAuth Params: [{"att_encoding_info","true"}]
>> > [debug] [<0.1076.0>] Minor error in HTTP request: {not_found,missing}
>> > [debug] [<0.1076.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>> >             {couch_httpd_db,db_doc_req,3},
>> >             {couch_httpd_db,do_db_req,2},
>> >             {couch_httpd,handle_request_int,5},
>> >             {mochiweb_http,headers,5},
>> >             {proc_lib,init_p_do_apply,3}]
>> > [info] [<0.1076.0>] 127.0.0.1 - - 'GET'
>> > /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>> > [debug] [<0.1076.0>] httpd 404 error response:
>> > {"error":"not_found","reason":"missing"}
>> >
>> >
>> > Could it be some sort of race condition?
>> >
>> >
>> >
>> > On Wed, Jun 9, 2010 at 8:22 PM, Paul Bonser <misterpib@gmail.com>
>> wrote:
>> >
>> >>
>> >>
>> >> On Wed, Jun 9, 2010 at 7:33 PM, J Chris Anderson <jchris@apache.org
>> >wrote:
>> >>
>> >>> Devs,
>> >>>
>> >>> Is anyone else seeing the replicator test hang and never finish?
>> >>>
>> >>> It never hangs the first few runs, but after running ten or so times,
>> I'll
>> >>> end up with the test suite waiting for a replication that never
>> finishes.
>> >>> It's the same story on 0.11.0, 0.11.x, and trunk.
>> >>>
>> >>> Is anyone else able to reproduce this? Am I crazy?
>> >>>
>> >>
>> >> It just froze for me on the first try, using 0.12.0a935298, then ran
>> >> successfully 3 times, then froze again. The last thing logged the first
>> time
>> >> was a _bulk_docs requests, the last thing logged this time was a PUT to
>> >> /test_suite_db_b/_local%2F6598a76aa55cd39645e4730b4cb28c00
>> >>
>> >> I'm running a Firefox 3.6 nightly build on Linux. For me, it froze the
>> >> first time when I did a "run all" and the second time when just
>> directly
>> >> running the replication test.
>> >>
>> >> After svn up-ing to the latest in trunk, it froze on the first try,
>> >> directly running the replication test.
>> >>
>> >> Here's the debug output for the last _replicate request where it
>> freezes.
>> >> It's requesting a document that isn't there.
>> >>
>> >>
>> >> [info] [<0.95.0>] starting new replication
>> >> "15c25eda4ea6308af6bea9864d5319ef" at <0.848.0>
>> >> [debug] [<0.191.0>] 'GET'
>> >> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true {1,1}
>> >> Headers: [{'Accept',"application/json"},
>> >>          {'Accept-Encoding',"gzip"},
>> >>          {'Host',"localhost:5985"},
>> >>          {'User-Agent',"CouchDB/0.12.0a953193"}]
>> >> [debug] [<0.191.0>] OAuth Params: [{"att_encoding_info","true"}]
>> >> [info] [<0.191.0>] 127.0.0.1 - - 'GET'
>> >> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200
>> >> [debug] [<0.189.0>] 'GET'
>> >> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1}
>> >> Headers: [{'Accept',"application/json"},
>> >>          {'Accept-Encoding',"gzip"},
>> >>          {'Host',"localhost:5985"},
>> >>          {'User-Agent',"CouchDB/0.12.0a953193"}]
>> >> [debug] [<0.189.0>] OAuth Params: [{"att_encoding_info","true"}]
>> >> [debug] [<0.189.0>] Minor error in HTTP request: {not_found,missing}
>> >> [debug] [<0.189.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4},
>> >>             {couch_httpd_db,db_doc_req,3},
>> >>             {couch_httpd_db,do_db_req,2},
>> >>             {couch_httpd,handle_request_int,5},
>> >>             {mochiweb_http,headers,5},
>> >>             {proc_lib,init_p_do_apply,3}]
>> >> [info] [<0.189.0>] 127.0.0.1 - - 'GET'
>> >> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404
>> >> [debug] [<0.189.0>] httpd 404 error response:
>> >> {"error":"not_found","reason":"missing"}
>> >>
>> >> [debug] [<0.191.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1}
>> >> Headers: [{'Accept',"application/json"},
>> >>          {'Accept-Encoding',"gzip"},
>> >>          {'Content-Length',"167"},
>> >>          {'Host',"localhost:5985"},
>> >>          {'User-Agent',"CouchDB/0.12.0a953193"},
>> >>          {"X-Couch-Full-Commit","false"}]
>> >> [debug] [<0.191.0>] OAuth Params: []
>> >> [info] [<0.191.0>] 127.0.0.1 - - 'POST'
>> >> /test_suite_rep_docs_db_b/_bulk_docs 201
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Paul Bonser
>> >> http://probablyprogramming.com
>> >>
>> >
>> >
>> >
>> > --
>> > Paul Bonser
>> > http://probablyprogramming.com
>>
>>
>
>
> --
> Paul Bonser
> http://probablyprogramming.com
>



-- 
Paul Bonser
http://probablyprogramming.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message