I may be wrong about which nodes the task is sent to.  

Others here know more about hadoop integration.

Aaron
  

On 22 Oct 2010, at 21:30, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

Hello, Aaron,
 
Thank you for much info (especially pointers that seem interesting).
 
> So you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes.
 
Yes, I meant one map task would be sent to each task tracker, resulting in 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot identify the nodes that actually hold some data, so the job tracker will send the map tasks to all of the 1,000 nodes. This is wasteful and time-consuming if only 200 nodes hold some data for a keyspace.
 
> When the task runs on the cassandra node it will iterate through all of the rows in the specified ColumnFamily with keys in the Token range the Node is responsible for.
 
I hope the ColumnFamilyInputFormat will allow us to set KeyRange to select rows passed to map.
 
I'll read the web pages you gave me. Thank you.
All, any other advice and comment is appreciated.
 
Regards,
Takayuki Tsunakawa
 
----- Original Message -----
From: aaron morton
To: user@cassandra.apache.org
Sent: Friday, October 22, 2010 4:05 PM
Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
 

For plain old log analysis the Cloudera Hadoop distribution may be a better match. Flume is designed to help with streaming data into HDFS, the LZo compression extensions would help with the data size and PIG would make the analysis easier (IMHO).
http://www.cloudera.com/hadoop/
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
 

I'll try to answer your questions, others please jump in if I'm wrong.
 

1. Data in a keyspace will be distributed to all nodes in the cassandra cluster. AFAIK the Job Tracker should only send one task to each task tracker, and normally you would have a task tracker running on each cassandra node. The task tracker can then throttle how may concurrent tasks can run. So you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes.
 

When the task runs on the cassandra node it will iterate through all of the rows in the specified ColumnFamily with keys in the Token range the Node is responsible for. If cassandra is using the RandomPartitioner, data will be spear around the cluster. So, for example, a Map-Reduce job that only wants to read the last weeks data may have to read from every node. Obviously this depends on how the data is broken up between rows / columns.
 
 
 

2. Some of the other people from riptano.com or rackspace may be able to help with Cassandra's outer limits. There is a 400 node cluster planned http://www.riptano.com/blog/riptano-and-digital-reasoning-form-partnership
 

Hope that helps.
Aaron