incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Frugal Erlang vs Resources Hungry CouchDB
Date Thu, 30 Jun 2011 22:45:29 GMT
On Thu, Jun 30, 2011 at 6:39 PM, Zdravko Gligic <zgligic@gmail.com> wrote:
> Robert Newson wrote:
>>>
> CouchDB *must* write an updated btree and an updated header to point
> to the root of that btree every time you update a document, or it will
> be lost if couch crashed right then.
> <<
>
> So, we have these 3 pieces of info that need to be written with every
> update of a document:
> 1) the btree
> 2) the updated header that points to the root of the btree
> 3) the actual json document itself
>
> If all 3 of these pieces are written to the same physical disk file
> then I will respectfully bail out, as the rest of my question would
> not make much sense, or at least not without major restructuring.
> However, if (1) the btree is in a file of its own and if (2) the
> updated header and (3) the acutal json document are written to the
> same file then ..
>

All three are written to the same physical disk file.

> a) How many of the update headers are actually useful?  Is it just the
> last successfully written one or even just a few last ones ?
>
> b) If only the last or last few headers are actually useful then could
> those updated headers not be kept in a separate (perhaps pre
> formatted) file, where the header records themselves were re-used
> (perhaps in a ring or some other fashion) ?
>
> c) If (a) and (b) make any sense then would one not result with a
> perfectly compacted DB for at least all of the logging type of use
> cases, where only new records are being created and existing ones are
> never updated nor deleted?
>
> d) While (c) might sound like a contrived "use case", I am asking
> mostly to determine what (in addition to dead old revisions and
> deleted docs) it is that is adding to the "bulkiness" of disk usage ?
> In other words, are those "updated headers" one of the major
> contributing factors (if not all of the factors) and could that be
> remedied?
>
> Thanks again and regards to everyone,
> teslan
>

There are a lot of places to change things here. I've spent much time
contemplating ways to make things work better here even covering some
of these ideas. The issue is that once you add a second file
descriptor you halve the number of databases (roughly, with hand
waving) that a server can have open and active at one time. For people
that are hosting a huge number of databases per node (think tens to
hundreds of thousands) this becomes a big issue.

I would also say to keep thinking on this issue. Its always possible
that we find a solution that improves one end of the scale more than
it hurts the opposite end. In that case its more than possible that we
end going ahead and making that trade off.

HTH,
Paul Davis

Mime
View raw message