couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Newson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (COUCHDB-220) Extreme sparseness in couch files
Date Sun, 05 Apr 2009 19:23:12 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695890#action_12695890
] 

Robert Newson commented on COUCHDB-220:
---------------------------------------

It appears that the .couch file is extended by 64k every time a document is added (regardless
of whether the document is a few hundred bytes).

Chatting with davisp, transcript below;
(
18:27:14) davisp: got that test handy so you can run it after a slight tweak to couchdb?
(18:27:46) rnewson: the sparseness one? yep.
(18:27:53) davisp: rnewson: line 41 in couchdb_stream.erl
(18:28:11) davisp: Try changing that from 16#000010000 to 1
(18:28:39) rnewson: min_alloc, yes?
(18:28:39) davisp: Not sure if that'll break things or not
(18:28:44) rnewson: we'll soon know.
(18:28:49) davisp: But I ran across it when reading
(18:28:55) davisp: rnewson: yep on min alloc
(18:29:12) rnewson: yes, that did it.

...

(18:34:14) rnewson: davisp: I'm glad you did, the difference is dramatic, I'd say this is
the cause of the behavior I see.
(18:34:36) davisp: It could be that couch_stream has a bug that's preventing it from using
leftover space
(18:34:43) rnewson: davisp: As I said, I actually hit the ext3 max-file-size with this problem.
(18:35:10) davisp: Ie, The 65K is intendeded to be used by multiple documents, but book keeping
is saying to constantly create new buffers

...

(19:01:00) davisp: rnewson: It just looks like the buffer state for.... oh dear god
(19:01:11) vmx: davisp: yes i get the idea, and the final output (e.g. in a browsers) seems
to be right, but the internal representation seems a bit confusing
(19:01:20) rnewson: davisp: epiphany?
(19:01:49) davisp: rnewson: I wonder if its only holding buffer state for the durating of
a single request. Try adding two attachments with the same data

...
(19:10:23) davisp: It looks like a consequence of the necessary code for streaming files that
didn't specify a content-length
(19:10:45) davisp: rnewson: Looks like ensure_buffer needs a flag
(19:13:01) davisp: rnewson: My guess is that you'd want to add a flag in the accumulator on
the PreAllocSize fold function that says if you have touched the clause that has an unknown
length
(19:13:21) davisp: then pass that flag to ensure_buffer and if the flag is true in ensure_buffer
you allocate exactly the specified size.
(19:13:30) davisp: instead of the MinSize bit
(19:13:49) rnewson: makes sense.










> Extreme sparseness in couch files
> ---------------------------------
>
>                 Key: COUCHDB-220
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-220
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.9
>         Environment: ubuntu 8.10 64-bit, ext3
>            Reporter: Robert Newson
>
> When adding ten thousand documents, each with a small attachment, the discrepancy between
reported file size and actual file size becomes huge;
> ls -lh shard0.couch
> 698M 2009-01-23 13:42 shard0.couch
> du -sh shard0.couch
> 57M	shard0.couch
> On filesystems that do not support write holes, this will cause an order of magnitude
more I/O.
> I think it was introduced by the streaming attachment patch as each attachment is followed
by huge swathes of zeroes when viewed with 'hd -v'.
> Compacting this database reduced it to 7.8mb, indicating other sparseness besides attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message