Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Date: Tue, 30 Mar 2010 16:54:18 +1100
Message-ID: <dfe1a2941003292254x2f773f66maa4af606d2d9eebe@mail.gmail.com>
Subject: Large data files and no "edit in place"?
From: Julian Simon <jsimon@jules.com.au>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Forgive me as I'm probably a little out of my depth in trying to
assess this particular design choice within Cassandra, but...

My understanding is that Cassandra never updates data "in place" on
disk - instead it completely re-creates the data files during a
"flush".  Stop me if I'm wrong already ;-)

So imagine we have a large data set in our ColumnFamily and we're
constantly adding data to it.

Every [x] minutes or [y] bytes, the compaction process is triggered,
and the entire data set is written to disk.

So as our data set grows over time, the compaction process will result
in an increasingly large IO operation to write all that data to disk
each time.

We could easily be talking about single data files in the
many-gigabyte size range, no?  Or is there a file size limit that I'm
not aware of?

If not, is this an efficient approach to take for large data sets?
Seems like we would become awfully IO bound, writing the entire thing
from scratch each time.

Do let me know if I've gotten it all wrong ;-)

Cheers,
Jules