couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Alonso <carlos.alo...@cabify.com>
Subject Re: Errors when moving a shard
Date Thu, 27 Jul 2017 08:17:56 GMT
Hi Robert, thanks for your reply. Very good SO response indeed.

Just a couple of questions yet.

When you say "Perform a replication, again taking care to do this on port
5986" what do you exactly mean? Could you please paste replication document
example? Isn't a replication started as soon as the shards map is modified?

Also, related to the issue I've experienced. When I manually delete the
shard from the first location, is there anything internal to CouchDB that
still references it? When I tried to move it back, the logs shown an error
complaining that the file already existed (but in reality it didn't).
Please see those errors pasted below:

1. Tries to create the shard, somehow detect that it existed before
(probably something I forgot to delete on step 6)

`mem3_shards tried to create shards/0fffffff-15555553/my-db.1500994155, got
file_exists` (edited)

2. gen_server crashes

`CRASH REPORT Process  (<0.27288.4>) with 0 neighbors exited with reason:
no match of right hand value {error,enoent} at couch_file:sync/1(line:211)
<= couch_db_updater:sync_header/2(line:987) <=
couch_db_updater:update_docs_int/5(line:906) <=
couch_db_updater:handle_info/2(line:289) <=
gen_server:handle_msg/5(line:599) <= proc_lib:wake_up/3(line:247) at
gen_server:terminate/6(line:737) <= proc_lib:wake_up/3(line:247);
initial_call: {couch_db_updater,init,['Argument__1']}, ancestors:
[<0.27267.4>], messages: [], links: [<0.210.0>], dictionary:
[{io_priority,{db_update,<<"shards/0fffffff-15555553/my_db...">>}},...],
trap_exit: false, status: running, heap_size: 6772, stack_size: 27,
reductions: 300961927`

3. Seems to somehow recover and try to open the file again

```
Could not open file ./data/shards/0fffffff-15555553/my_db.1500994155.couch:
no such file or directory

open_result error {not_found,no_db_file} for
shards/0fffffff-15555553/my_db.1500994155
```

4. Tries to create the file

`creating missing database: shards/0fffffff-15555553/my_db.1500994155`


Thank you very much.

On Thu, Jul 27, 2017 at 9:59 AM Robert Samuel Newson <rnewson@apache.org>
wrote:

> Not sure if you saw my write-up from the BigCouch era, still valid for
> CouchDB 2.0;
>
>
> https://stackoverflow.com/questions/6676972/moving-a-shard-from-one-bigcouch-server-to-another-for-balancing
>
> Shard moving / database rebalancing is definitely a bit tricky and we
> could use better tools for it.
>
>
> > On 26 Jul 2017, at 18:10, Carlos Alonso <carlos.alonso@cabify.com>
> wrote:
> >
> > Hi!
> >
> > I have had a few log errors when moving a shard under particular
> > circumstances and I'd like to share it here and get your input on whether
> > this should be reported or not.
> >
> > So let me describe the steps I took:
> >
> > 1. 3 nodes cluster (couch-0, couch-1 and couch-2), 1 database (my_db)
> with
> > 48 shards and 1 replica
> > 2. A 4th node (couch-3) is added to the cluster.
> > 3. Change shards map so that the last one gets one of the shards from
> > couch-0 (at this moment both couch-0 and couch-4 contain the shard)
> > 4. Synchronisation happens and the new node gets its shard
> > 5. Change shards map again so that couch-0 is not that shard owner
> anymore
> > 6. I go into couch-0 node and manually delete the .couch file of the
> shard,
> > to reclaim disk space
> > 7. All fine here
> >
> > 8. Now I want to put back the shard into the original node, where it was
> > before
> > 9. I put couch-0 into maintenance (I always do this before adding a shard
> > to a node, to avoid it responding to reads before it is synced)
> > 10. Modify the shards map adding the shard to the couch-0
> > 11. All nodes logs get full of errors (details below)
> > 12. I remove couch-0 maintenance mode and things seem to flow again
> >
> > So this is the process, now please let me describe what I spotted on the
> > logs:
> >
> > Couch-0 seems to go through a few statuses:
> >
> > 1. Tries to create the shard, somehow detect that it existed before
> > (probably something I forgot to delete on step 6)
> >
> > `mem3_shards tried to create shards/0fffffff-15555553/my-db.1500994155,
> got
> > file_exists` (edited)
> >
> > 2. gen_server crashes
> >
> > `CRASH REPORT Process  (<0.27288.4>) with 0 neighbors exited with reason:
> > no match of right hand value {error,enoent} at
> couch_file:sync/1(line:211)
> > <= couch_db_updater:sync_header/2(line:987) <=
> > couch_db_updater:update_docs_int/5(line:906) <=
> > couch_db_updater:handle_info/2(line:289) <=
> > gen_server:handle_msg/5(line:599) <= proc_lib:wake_up/3(line:247) at
> > gen_server:terminate/6(line:737) <= proc_lib:wake_up/3(line:247);
> > initial_call: {couch_db_updater,init,['Argument__1']}, ancestors:
> > [<0.27267.4>], messages: [], links: [<0.210.0>], dictionary:
> > [{io_priority,{db_update,<<"shards/0fffffff-15555553/my_db...">>}},...],
> > trap_exit: false, status: running, heap_size: 6772, stack_size: 27,
> > reductions: 300961927`
> >
> > 3. Seems to somehow recover and try to open the file again
> >
> > ```
> > Could not open file
> ./data/shards/0fffffff-15555553/my_db.1500994155.couch:
> > no such file or directory
> >
> > open_result error {not_found,no_db_file} for
> > shards/0fffffff-15555553/my_db.1500994155
> > ```
> >
> > 4. Tries to create the file
> >
> > `creating missing database: shards/0fffffff-15555553/my_db.1500994155`
> >
> > 5. Continuously fails because it cannot load validation funcs, possibly
> > because of the maintenance mode?
> >
> > ```
> > Error in process <0.2126.141> on node 'couchdb@couch-0' with exit value:
> > {{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> >
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[...
> >
> > Error in process <0.1970.141> on node 'couchdb@couch-0' with exit value:
> >
> {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> >
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener...
> >
> > could not load validation funs
> >
> {{case_clause,{error,{{badmatch,{error,{maintenance_mode,nil,'couchdb@couch-0
> >
> '}}},[{ddoc_cache_opener,recover_validation_funs,1,[{file,"src/ddoc_cache_opener.erl"},{line,127}]},{ddoc_cache_opener,fetch_doc_data,1,[{file,"src/ddoc_cache_opener.erl"},{line,240}]}]}}},[{ddoc_cache_opener,handle_open_response,1,[{file,"src/ddoc_cache_opener.erl"},{line,282}]},{couch_db,'-load_validation_funs/1-fun-0-',1,[{file,"src/couch_db.erl"},{line,659}]}]}
> > ```
> >
> > Couch-3 shows a warning and an error
> >
> > ```
> > [warning] ... -------- mem3_sync
> shards/0fffffff-15555553/my_db.1500994155
> > couchdb@couch-0
> >
> {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,267}]},{mem3_rep,save_on_target,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,286}]},{mem3_rep,replicate_batch,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,256}]},{mem3_rep,repl,2,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,178}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,81}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,208}]}]}
> > ```
> > ```
> > [error] ... -------- Error in process <0.13658.13> on node
> 'couchdb@couch-3'
> > with exit value:
> >
> {internal_server_error,[{mem3_rpc,rexi_call,2,[{file,"src/mem3_rpc.erl"},{line,267}]},{mem3_rep,save_on_target,3,[{file,"src/mem3_rep.erl"},{line,286}]},{mem3_rep,replicate_batch,1,[{file,"src/mem3_rep.erl"},{line,256}]},{mem3_rep...
> > ```
> >
> > Which, to me, means that couch-0 is responding internal server errors to
> > his requests
> >
> > Couch-1, which is, by the way, the owner of the task for replicating
> my_db
> > from a remote server seems to go through two statuses:
> >
> > First seems as being unable to continue with the replication process
> > because receives a 500 error (maybe from couch-0?)
> > ```
> > [error] ... req_err(4096501418 <(409)%20650-1418>) unknown_error :
> badarg
> >    [<<"dict:fetch/2 L130">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2
> > L424">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2
> > L424">>,<<"fabric_doc_update:go/3 L41">>,<<"fabric:update_docs/3
> > L259">>,<<"chttpd_db:db_req/2 L445">>,<<"chttpd:process_request/1
> > L293">>,<<"chttpd:handle_request_int/1 L229">>]
> >
> >
> > [notice] ... 127.0.0.1:5984 127.0.0.1 undefined POST /my_db/_bulk_docs
> 500
> > ok 114
> >
> > [notice] ... Retrying POST request to
> http://127.0.0.1:5984/my_db/_bulk_docs
> > in 0.25 seconds due to error {code,500}
> > ```
> >
> > After disabling maintenance on couch-0 the replication process seem to
> work
> > again, but a few seconds later lots of new errors appear again:
> >
> > ```
> > [error]  -------- rexi_server exit:timeout
> >
> [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > ```
> >
> > And a few seconds later (I have not been able to correlate it to anything
> > so far) they stop.
> >
> > Finally, couch-2 just show one error, the same as the last from couch-1
> >
> > ```
> > [error]  -------- rexi_server exit:timeout
> >
> [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > ```
> >
> > *In conclusion:*
> >
> > To me it looks like two things are involved here:
> >
> > 1. The fact that I deleted the file from disk and something else still
> know
> > that it should be there
> > 2. The fact that the node is under maintenance and it seems that prevents
> > from new shards to be created
> >
> > Sorry for such a wall of text. I hope it is detailed enough to get
> > someone's input on this that can help me confirm or refuse my theories
> and
> > decide whether it makes sense or not to open a GH issue to make the
> process
> > more robust at this stage.
> >
> > Regards
> >
> >
> > --
> > [image: Cabify - Your private Driver] <http://www.cabify.com/>
> >
> > *Carlos Alonso*
> > Data Engineer
> > Madrid, Spain
> >
> > carlos.alonso@cabify.com
> >
> > Prueba gratis con este código
> > #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES
> >[image:
> > Linkedin] <https://www.linkedin.com/in/mrcalonso>
> >
> > --
> > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> > destinatario, pudiendo contener información confidencial sometida a
> secreto
> > profesional. No está permitida su reproducción o distribución sin la
> > autorización expresa de Cabify. Si usted no es el destinatario final por
> > favor elimínelo e infórmenos por esta vía.
> >
> > This message and any attached file are intended exclusively for the
> > addressee, and it may be confidential. You are not allowed to copy or
> > disclose it without Cabify's prior written authorization. If you are not
> > the intended recipient please delete it from your system and notify us by
> > e-mail.
>
> --
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos.alonso@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

-- 
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 
e-mail.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message