Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <loom.20100804T223740-755@post.gmane.org>
References: <loom.20100707T174053-205@post.gmane.org>
	<AANLkTikdDP6PG54q9OGmDApbcXlz7jK0rv6vbCqReVU-@mail.gmail.com>
	<AANLkTinnqSUZ9aDos7Njxmwg3uxPW0qPgwluGny0rb44@mail.gmail.com>
	<loom.20100707T190446-834@post.gmane.org>
	<AANLkTikDtcMwRcgJiWHm_UFa2JyY-aqpyrOYmN8-4BXB@mail.gmail.com>
	<loom.20100707T201738-422@post.gmane.org>
	<AANLkTikhrPQA8QIqlBgVz5DgCrv0nxK-17U2XNrWJ4XB@mail.gmail.com>
	<loom.20100708T152704-980@post.gmane.org>
	<AANLkTilMqu8xE7WE-RfiH5gFnerJo0YlICAqj7yjRuVc@mail.gmail.com>
	<loom.20100723T173649-633@post.gmane.org>
	<AANLkTingHTXJCkVEU6Y8Sv58TTw_pz9YH=0pj1g_8rqT@mail.gmail.com>
	<AANLkTikG+fny5Vm+fmo2nN6Rhc4Wd=1Y1hm=34Ln7rfP@mail.gmail.com>
	<loom.20100727T193000-758@post.gmane.org>
	<AANLkTim0tAdxFsq-+Gmg0bQFG-M_yFdvVBTNUVzE_BMX@mail.gmail.com>
	<loom.20100804T223740-755@post.gmane.org>
Date: Thu, 5 Aug 2010 20:51:56 +0200
Message-ID: <AANLkTinKDiE0oRZYw+MxT6o4XGiMVBQGaAiYX5aG8M9s@mail.gmail.com>
Subject: Re: Cassandra disk space utilization WAY higher than I would expect
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

Oh and,

> Nodetool cleanup works so beautifully, that I am wondering if there is any harm
> in using "nodetool cleanup" in a cron job on a live system that is actively
> processing reads and writes to the database?

since a cleanup/compact is supposed to trigger a full compaction,
that's generating a lot of I/O that is normally deferred. Doing it
nightly won't scale in the general case (suppose you have terrabytes
of data...). But assuming compaction does not have a sufficiently
detrimental effect on your application and your data size is small
enough that a compaction finishes within some reasonable amount of
time, I don't see a problem with it.

That said, since cleanup by design actually permanently removes data
not belonging to nodes, it might be dangerous in the event that there
is somehow confusion over what data nodes are responsible for. A
regular compaction should not, as far as I know, ever remove data not
belonging to the node. So I can certainly see a danger there;
personally I'd probably want to avoid automating 'cleanup' for that
reason.

But; if everything works reasonably it still seems to me that you
should not be seeing extreme wastes of diskspace. If you truly need to
compact nightly to save space, you might be running too close to your
maximum disk capacity anyway. That said if you're still seeing extreme
amounts of "extra" disk space I don't think there is yet an
explanation for that in this thread.

Also, the variation in disk space in your most recent post looks
entirely as expected to me and nothing really extreme. The temporary
disk space occupied during the compact/cleanup would easily be as high
as your original disk space usage to begin with, and the fact that
you're reaching the 5-7 GB per node level after a cleanup has
completed fully and all obsolete sstables have been removed, does not
necessarily help you since each future cleanup/compaction will
typically double your diskspace anyway (even if temporarily).

-- 
/ Peter Schuller