cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takayuki Tsunakawa" <>
Subject Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
Date Mon, 25 Oct 2010 02:09:08 GMT
Hello, Jonathan,

Thank you for your kind reply. Could you give me some more

From: "Jonathan Ellis" <>
> (b) Cassandra generates input splits from the sampling of keys each
> node has in memory.  So if a node does end up with no data for a
> keyspace (because of bad OOP balancing for instance) it will have no
> splits generated or mapped.

I understood you are referring to StorageService.getSplits(). This
seems to filter out the Cassandra nodes which have no data for the
target (keyspace, column family) pair.

I understood that ColumnFamilyInputFormat requests the above node (or
split) filtering to all nodes in the cluster. Is this correct?

If Q1 is yes, more nodes result in higher cost of MapReduce job
startup (for executing InputFormat.getSplits()). Do you have any
performance numbers about this startup cost (time)? I'd like to know
how high it is when the cluster consists of hundreds of nodes.

Going back to my first mail, I'm wondering if the present Cassandra is
applicable to the analysis of petabytes of data.
How much data is aimed at by the 400 node cluster Riptano is planning?
If each node has 4 TB of disks and the replication factor is 3, the
simple calculation shows 4 TB * 400 / 3 = 533 TB (ignoring commit log,
OS areas, etc).
Based on the current architecture, how many nodes is the limit and how
much (approximate) data is the practical limit?

Takayuki Tsunakawa

View raw message