couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Shorin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2102) Downstream replicator database bloat
Date Fri, 07 Mar 2014 22:35:44 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924418#comment-13924418
] 

Alexander Shorin commented on COUCHDB-2102:
-------------------------------------------

Fedor Indutny (@indutny on IRC) recently reported about the same behavior using 1.5.0 release
and filtered replication on npm registry. While actual data is under 300MB, db disk size grew
up to 81GB:

{code}
{
  "committed_update_seq": 224,
  "disk_format_version": 6,
  "instance_start_time": "1393936995838019",
  "db_name": "yandex-packages",
  "doc_count": 208,
  "doc_del_count": 0,
  "update_seq": 224,
  "purge_seq": 0,
  "compact_running": false,
  "disk_size": 81703006328,
  "data_size": 279752269
}
{code}

I'd failed to reproduce this, but seems to be this is not the local anomaly. 

Could you, [~isaacs], [~terinjokes] and everyone else, provide some more information about
your environment. It would be also awesome if you can provide any database file which suffered
from this bug to let us investigate in what is that bloat data there.

> Downstream replicator database bloat
> ------------------------------------
>
>                 Key: COUCHDB-2102
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2102
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Replication
>            Reporter: Isaac Z. Schlueter
>
> When I do continuous replication from one db to another, I get a lot of bloat over time.
> For example, replicating a _users db with a relatively low level of writes, and around
30,000 documents, the size on disk of the downstream replica was over 300MB after 2 weeks.
 I compacted the DB, and the size dropped to about 20MB (slightly smaller than the source
database).
> Of course, I realize that I can configure compaction to happen regularly.  But this still
seems like a rather excessive tax.  It is especially shocking to users who are replicating
a 100GB database full of attachments, and find it grow to 400GB if they're not careful!  You
can easily end up in a situation where you don't have enough disk space to successfully compact.
> Is there a fundamental reason why this happens?  Or has it simply never been a priority?
 It'd be awesome if replication were more efficient with disk space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message