couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konrad Förstner <kon...@foerstner.org>
Subject Re: Why do my CouchDB databases grow so fast?
Date Tue, 01 Jun 2010 08:43:46 GMT
On Mon May 31, 2010 at 06:02:56PM -0700, Randall Leeds wrote:
>I don't think so, but check his script on github. Here's a link to the lines.
>http://github.com/konrad/couchdb-benchmarking/blob/master/couchdb-benchmarking.rb#L84

Many thanks for your explanations, Randal and Nick. I am still
astonished about the dimensions of disk space which this underlying
data handling needs but I guess I have to live with it.

Chees

Konrad

>On Mon, May 31, 2010 at 17:39, Nicholas Orr <nicholas.orr@zxgen.net> wrote:
>> Ahh yes, I see that now...
>>
>> Is that what the grey faint line on the end of those graphs represent?
>> I actually didn't notice that the first time, just thought the black
>> line going up was it...
>>
>> On Tue, Jun 1, 2010 at 10:13 AM, Randall Leeds <randall.leeds@gmail.com> wrote:
>>> If you look at the script the compaction is only performed at the end
>>> and not on each iteration.
>>>
>>> On Mon, May 31, 2010 at 16:58, Nicholas Orr <nicholas.orr@zxgen.net> wrote:
>>>> This all makes sense except the OP says a compaction step is being
>>>> performed.
>>>> A compaction is essentially a copy/paste/delete/rename operation, so the
on
>>>> disk size should be fairly constant as the data copied is just the info
>>>> required isn't it?
>>>>
>>>> Nick
>>>>
>>>> On Tue, Jun 1, 2010 at 9:39 AM, Randall Leeds <randall.leeds@gmail.com>wrote:
>>>>
>>>>> Hi Konrad,
>>>>>
>>>>> I'll take a stab at this and if I'm wrong hopefully someone will correct
>>>>> me.
>>>>>
>>>>> The on disk BTree is written in an append only fashion rather than
>>>>> modified in place. Append only updates mean that every inner node of
>>>>> the BTree along the path from the root to the new update has to be
>>>>> re-written each time. Initially, when there are very few inner nodes,
>>>>> the amount of disk space used for each new update is relatively
>>>>> constant. Since the tree has a large fan-out the depth does not change
>>>>> much at first. In the second graph you are seeing a tree that has a
>>>>> depth of 1 (just the root) being written over an over again to disk
>>>>> and the corresponding expected linear growth results. However, when
>>>>> you have a higher revision limit the old revisions are kept in the
>>>>> tree and the tree grows taller and fatter with each update. As you
>>>>> make more updates more inner nodes need to be rewritten for each
>>>>> update which causes the growth to accelerate. Eventually, you hit the
>>>>> revision limit and old revisions are discarded, the tree stops getting
>>>>> any taller or fatter and the number of inner nodes that need to be
>>>>> changed for each update remains relatively constant (but greater than
>>>>> in the case of rev_limit=1). I suspect that the first graph becomes
>>>>> linear above 1000 updates and does not continue to accelerate.
>>>>>
>>>>> Cheers,
>>>>> Randall
>>>>>
>>>>> 2010/5/31 Konrad Förstner <konrad@foerstner.org>:
>>>>> > Hi,
>>>>> >
>>>>> > I have an issue with CouchDB and posted the question on stackoverflow
>>>>> > [1] but did not get any helpful answer. I would be great if somebody
>>>>> > could answer this here or a stackoverflow (there I also had a problem
>>>>> > with the compaction which was just a timing issue as explaint in
the
>>>>> > comment)
>>>>> >
>>>>> > I was wondering why my CouchDB database was growing to fast so I
wrote
>>>>> > a little test script [2]. This script changes an attributed of a
CouchDB
>>>>> > document 1200 times and takes the size of the database after each
>>>>> > change. After performing these 1200 writing steps the database is
>>>>> > doing a compaction step and the db size is measured again. In the
end
>>>>> > the script plots the databases size against the revision numbers.
The
>>>>> > benchmarking is run twice:
>>>>> >
>>>>> > * The first time the default number of document revision (=1000)
is used
>>>>> (_revs_limit).
>>>>> >
>>>>> > * The second time the number of document revisions is set to 1.
>>>>> >
>>>>> > The first run produces the following plot
>>>>> > http://www.flickr.com/photos/konradfoerstner/4656011444/
>>>>> >
>>>>> > The second run produces this plot second run
>>>>> > http://www.flickr.com/photos/konradfoerstner/4656012732/
>>>>> >
>>>>> > For me this is quite an unexpected behavior. In the first run I
would
>>>>> > have expected a linear growth as every change produces a new
>>>>> > revision. When the 1000 revisions are reached the size value should
be
>>>>> > constant as the older revisions are discarded.
>>>>> >
>>>>> > In the second run the first revision should result in certain database
>>>>> > size that is then keeps during the following writing steps as every
>>>>> > new revision leads to the deletion of the previous one.
>>>>> >
>>>>> > I could understand if there is a little bit of overhead needed to
>>>>> > manage the changes but this growth behavior seems weird to me. Can
>>>>> > anybody explain this phenomenon or correct my assumptions that lead
to
>>>>> > the wrong expectations?
>>>>> >
>>>>> > Many thanks in advance
>>>>> >
>>>>> > Konrad
>>>>> >
>>>>> > [1]
>>>>> http://stackoverflow.com/questions/2921151/why-do-my-couchdb-databases-grow-so-fast
>>>>> > [2] http://github.com/konrad/couchdb-benchmarking
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>
>>

Mime
View raw message