Thank you for much info (especially
pointers that seem interesting).
> So you would not have 1,000 tasks
sent to each of the 1,000 cassandra nodes.
Yes, I meant one map task would be sent to
each task tracker, resulting in 1,000 concurrent map tasks in the cluster.
ColumnFamilyInputFormat cannot identify the nodes that actually hold some data,
so the job tracker will send the map tasks to all of the 1,000 nodes. This is
wasteful and time-consuming if only 200 nodes hold some data for a
> When the task runs on the cassandra
node it will iterate through all of the rows in the specified ColumnFamily with
keys in the Token range the Node is responsible for.
I hope the ColumnFamilyInputFormat will
allow us to set KeyRange to select rows passed to map.
I'll read the web pages you gave me. Thank
All, any other advice and comment is appreciated.
----- Original Message -----
Friday, October 22, 2010 4:05 PM
Subject: Re: [Q] MapReduce behavior and
Cassandra's scalability for petabytes of data
I'll try to answer your questions, others please jump in if I'm
1. Data in a keyspace will be distributed to all nodes in the cassandra
cluster. AFAIK the Job Tracker should only send one task to each task tracker,
and normally you would have a task tracker running on each cassandra node. The
task tracker can then throttle how may concurrent tasks can run. So you would
not have 1,000 tasks sent to each of the 1,000 cassandra nodes.
When the task runs on the cassandra node it will iterate through all of
the rows in the specified ColumnFamily with keys in the Token range the Node is
responsible for. If cassandra is using the RandomPartitioner, data will be spear
around the cluster. So, for example, a Map-Reduce job that only wants to read
the last weeks data may have to read from every node. Obviously this depends on
how the data is broken up between rows / columns.
Hope that helps.