couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Newson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2102) Downstream replicator database bloat
Date Sat, 08 Mar 2014 09:56:43 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924781#comment-13924781
] 

Robert Newson commented on COUCHDB-2102:
----------------------------------------

The replicator is more efficient at replicating documents without attachments than with. For
docs without attachments, it's using _bulk_docs and sending hundreds of docs at once. Because
it's updating in bulk, it's making less trash, almost as if it's building the post-compaction
structure directly. For docs with attachments, they're written as separate multipart/related
requests, and more trash is generated.

The true fix is to make the replicator smarter about attachments so that we can bulk transfer
groups of those.

All this said, compaction still remains necessary (and you can run it at any time during the
replication). Finally, there's room for improvement in general (one of my first tickets was
COUCHDB-220 on this general theme).

> Downstream replicator database bloat
> ------------------------------------
>
>                 Key: COUCHDB-2102
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2102
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Replication
>            Reporter: Isaac Z. Schlueter
>
> When I do continuous replication from one db to another, I get a lot of bloat over time.
> For example, replicating a _users db with a relatively low level of writes, and around
30,000 documents, the size on disk of the downstream replica was over 300MB after 2 weeks.
 I compacted the DB, and the size dropped to about 20MB (slightly smaller than the source
database).
> Of course, I realize that I can configure compaction to happen regularly.  But this still
seems like a rather excessive tax.  It is especially shocking to users who are replicating
a 100GB database full of attachments, and find it grow to 400GB if they're not careful!  You
can easily end up in a situation where you don't have enough disk space to successfully compact.
> Is there a fundamental reason why this happens?  Or has it simply never been a priority?
 It'd be awesome if replication were more efficient with disk space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message