couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Alonso <>
Subject Errors when moving a shard
Date Wed, 26 Jul 2017 17:10:56 GMT

I have had a few log errors when moving a shard under particular
circumstances and I'd like to share it here and get your input on whether
this should be reported or not.

So let me describe the steps I took:

1. 3 nodes cluster (couch-0, couch-1 and couch-2), 1 database (my_db) with
48 shards and 1 replica
2. A 4th node (couch-3) is added to the cluster.
3. Change shards map so that the last one gets one of the shards from
couch-0 (at this moment both couch-0 and couch-4 contain the shard)
4. Synchronisation happens and the new node gets its shard
5. Change shards map again so that couch-0 is not that shard owner anymore
6. I go into couch-0 node and manually delete the .couch file of the shard,
to reclaim disk space
7. All fine here

8. Now I want to put back the shard into the original node, where it was
9. I put couch-0 into maintenance (I always do this before adding a shard
to a node, to avoid it responding to reads before it is synced)
10. Modify the shards map adding the shard to the couch-0
11. All nodes logs get full of errors (details below)
12. I remove couch-0 maintenance mode and things seem to flow again

So this is the process, now please let me describe what I spotted on the

Couch-0 seems to go through a few statuses:

1. Tries to create the shard, somehow detect that it existed before
(probably something I forgot to delete on step 6)

`mem3_shards tried to create shards/0fffffff-15555553/my-db.1500994155, got
file_exists` (edited)

2. gen_server crashes

`CRASH REPORT Process  (<0.27288.4>) with 0 neighbors exited with reason:
no match of right hand value {error,enoent} at couch_file:sync/1(line:211)
<= couch_db_updater:sync_header/2(line:987) <=
couch_db_updater:update_docs_int/5(line:906) <=
couch_db_updater:handle_info/2(line:289) <=
gen_server:handle_msg/5(line:599) <= proc_lib:wake_up/3(line:247) at
gen_server:terminate/6(line:737) <= proc_lib:wake_up/3(line:247);
initial_call: {couch_db_updater,init,['Argument__1']}, ancestors:
[<0.27267.4>], messages: [], links: [<0.210.0>], dictionary:
trap_exit: false, status: running, heap_size: 6772, stack_size: 27,
reductions: 300961927`

3. Seems to somehow recover and try to open the file again

Could not open file ./data/shards/0fffffff-15555553/my_db.1500994155.couch:
no such file or directory

open_result error {not_found,no_db_file} for

4. Tries to create the file

`creating missing database: shards/0fffffff-15555553/my_db.1500994155`

5. Continuously fails because it cannot load validation funcs, possibly
because of the maintenance mode?

Error in process <0.2126.141> on node 'couchdb@couch-0' with exit value:

Error in process <0.1970.141> on node 'couchdb@couch-0' with exit value:

could not load validation funs

Couch-3 shows a warning and an error

[warning] ... -------- mem3_sync shards/0fffffff-15555553/my_db.1500994155
[error] ... -------- Error in process <0.13658.13> on node 'couchdb@couch-3'
with exit value:

Which, to me, means that couch-0 is responding internal server errors to
his requests

Couch-1, which is, by the way, the owner of the task for replicating my_db
from a remote server seems to go through two statuses:

First seems as being unable to continue with the replication process
because receives a 500 error (maybe from couch-0?)
[error] ... req_err(4096501418) unknown_error : badarg
    [<<"dict:fetch/2 L130">>,<<"couch_util:-reorder_results/2-lc$^1/1-1-/2
L424">>,<<"fabric_doc_update:go/3 L41">>,<<"fabric:update_docs/3
L259">>,<<"chttpd_db:db_req/2 L445">>,<<"chttpd:process_request/1
L293">>,<<"chttpd:handle_request_int/1 L229">>]

[notice] ... undefined POST /my_db/_bulk_docs 500
ok 114

[notice] ... Retrying POST request to
in 0.25 seconds due to error {code,500}

After disabling maintenance on couch-0 the replication process seem to work
again, but a few seconds later lots of new errors appear again:

[error]  -------- rexi_server exit:timeout

And a few seconds later (I have not been able to correlate it to anything
so far) they stop.

Finally, couch-2 just show one error, the same as the last from couch-1

[error]  -------- rexi_server exit:timeout

*In conclusion:*

To me it looks like two things are involved here:

1. The fact that I deleted the file from disk and something else still know
that it should be there
2. The fact that the node is under maintenance and it seems that prevents
from new shards to be created

Sorry for such a wall of text. I hope it is detailed enough to get
someone's input on this that can help me confirm or refuse my theories and
decide whether it makes sense or not to open a GH issue to make the process
more robust at this stage.


[image: Cabify - Your private Driver] <>

*Carlos Alonso*
Data Engineer
Madrid, Spain

Prueba gratis con este código
#CARLOSA6319 <>
[image: Facebook] <>[image: Twitter]
<>[image: Instagram] <>[image:
Linkedin] <>

Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su 
destinatario, pudiendo contener información confidencial sometida a secreto 
profesional. No está permitida su reproducción o distribución sin la 
autorización expresa de Cabify. Si usted no es el destinatario final por 
favor elimínelo e infórmenos por esta vía. 

This message and any attached file are intended exclusively for the 
addressee, and it may be confidential. You are not allowed to copy or 
disclose it without Cabify's prior written authorization. If you are not 
the intended recipient please delete it from your system and notify us by 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message