cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
Date Fri, 22 Oct 2010 07:05:14 GMT
For plain old log analysis the Cloudera Hadoop distribution may be a better match. Flume is
designed to help with streaming data into HDFS, the LZo compression extensions would help
with the data size and PIG would make the analysis easier (IMHO).

I'll try to answer your questions, others please jump in if I'm wrong.

1. Data in a keyspace will be distributed to all nodes in the cassandra cluster. AFAIK the
Job Tracker should only send one task to each task tracker, and normally you would have a
task tracker running on each cassandra node. The task tracker can then throttle how may concurrent
tasks can run. So you would not have 1,000 tasks sent to each of the 1,000 cassandra nodes.

When the task runs on the cassandra node it will iterate through all of the rows in the specified
ColumnFamily with keys in the Token range the Node is responsible for. If cassandra is using
the RandomPartitioner, data will be spear around the cluster. So, for example, a Map-Reduce
job that only wants to read the last weeks data may have to read from every node. Obviously
this depends on how the data is broken up between rows / columns. 

2. Some of the other people from or rackspace may be able to help with Cassandra's
outer limits. There is a 400 node cluster planned

Hope that helps. 

On 22 Oct 2010, at 15:45, Takayuki Tsunakawa wrote:

> Hello,
> I'm evaluating whether Cassandra fits a certain customer well. The
> customer will collect petabytes of logs and analyze them. Could you
> tell me if my understanding is correct and/or give me your opinions?
> I'm sorry that the analysis requirement is not clear yet.
> 1. MapReduce behavior
> I read the source code of Cassandra 0.6.x and understood that
> jobtracker submits the map tasks to all Cassandra nodes, regardless of
> whether the target keyspace's data reside there. That is, if there are
> 1,000 nodes in the Cassandra cluster, jobtracker sends more than 1,000
> map tasks to all of the 1,000 nodes in parallel. If this is correct,
> I'm afraid the startup time of a MapReduce job gets longer as more
> nodes join the Cassandra cluster.
> Is this correct?
> With HBase, jobtracker submits map tasks only to the region servers
> that hold the target data. This behavior is desirable because no
> wasteful task submission is done. Can you suggest the cases where
> Cassandra+MapReduce is better than HBase+MapReduce for log/sensor
> analysis? (Please excuse me for my not presenting the analysis
> requirement).
> 2. Data capacity
> The excerpt from the paper about Amazon Dynamo says that the cluster
> can scale to hundreds of nodes, not thousands. I understand Cassandra
> is similar. Assuming that the recent commodity servers have 2 to 4 TB
> of disks, we need about 1,000 nodes or more to store petabytes of
> data.
> Is the present Cassandra suitable for petabytes of data? If not, is
> any development in progress to increase the scalability?
> --------------------------------------------------
> Finally, Dynamo adopts a full membership model where each node is
> aware of the data hosted by its peers. To do this, each node actively
> gossips the full routing table with other nodes in the system. This
> model works well for a system that contains couple of hundreds of
> nodes. However, scaling such a design to run with tens of thousands of
> nodes is not trivial because the overhead in maintaining the routing
> table increases with the system size. This limitation might be
> overcome by introducing hierarchical extensions to Dynamo. Also, note
> that this problem is actively addressed by O(1) DHT systems(e.g.,
> [14]).
> --------------------------------------------------
> Regards,
> Takayuki Tsunakawa

View raw message