cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
Date Fri, 22 Oct 2010 13:48:01 GMT
On Fri, Oct 22, 2010 at 3:30 AM, Takayuki Tsunakawa
<> wrote:
> Yes, I meant one map task would be sent to each task tracker, resulting in
> 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot
> identify the nodes that actually hold some data, so the job tracker will
> send the map tasks to all of the 1,000 nodes. This is wasteful and
> time-consuming if only 200 nodes hold some data for a keyspace.

(a) Normally all data from each keyspace is spread around each node in
the cluster.  This is what you want for best parallelism.

(b) Cassandra generates input splits from the sampling of keys each
node has in memory.  So if a node does end up with no data for a
keyspace (because of bad OOP balancing for instance) it will have no
splits generated or mapped.

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support

View raw message