Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <4DF082E6.7030700@dude.podzone.net>
Date: Thu, 09 Jun 2011 02:23:02 -0600
From: AJ <aj@dude.podzone.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
 rv:1.9.2.17) Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Ideas for Big Data Support
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

[Please feel free to correct me on anything or suggest other workarounds 
that could be employed now to help.]

Hello,

This is purely theoretical, as I don't have a big working cluster yet 
and am still in the planning stages, but from what I understand, while 
Cass scales well horizontally, EACH node will not be able to handle well 
a data store in the terabyte range... for reasons that are 
understandable such as simple hardware and bandwidth limitations.  But, 
looking forward and pushing the envelope, I think there might be ways to 
at least manage these issues until broadband speeds, disk and memory 
technology catches up.

The biggest issues with big data clusters that I am currently aware are:

 > disk I/O probs during major compaction and repairs.
 > Bandwidth limitations during new node commissioning.

Here are a few ideas I've thought of:

1.)  Load-balancing:

During a major compaction or repair or other similar severe performance 
impacting processes, allow the node to broadcast that it is temporarily 
unavailable so requests for data can be sent to other nodes in the 
cluster.  The node could still "wake-up" and pause or cancel it's 
compaction in the case of a failed node whereby there are no other nodes 
that can provide the data requested.  The node could be considered as 
"degraded" by other nodes, rather than down.  (As a matter of fact, a 
general load-balancing scheme could be devised if each node broadcasts 
it's current load level and maybe even hop-count between data centers.)

2.)  Data Transplants:

Since commissioning a new node that is due to receive data in the TB 
range (data xfer could take days or weeks), it would be much more 
efficient to just courier the data.  Perhaps the SSTables (maybe from a 
snapshot) could be transplanted from one production node into a new node 
to help jump-start the bootstrap process.  The new node could sort 
things out during the bootstrapping phase so that it is balanced 
correctly as if it had started out with no data as usual.  If this could 
cut down on half the bandwidth, that would be a great benefit.  However, 
this would work well mostly if the transplanted data came from a 
keyspace that used a random partitioner; coming from an ordered 
partioner may not be so helpful if the rows in the transplanted data 
would never be used in the new node.

3.)  Strategic Partitioning:

Of course, there are surely other issues to contend with, such as RAM 
requirements for caching purposes.  That may be managed by a partition 
strategy that allows certain nodes to specialize in a certain subset of 
the data, such as geographically or whatever the designer chooses.  
Replication would still be done as usual but this may help the cache to 
be better utilized by allowing it to focus on the subset of data that 
comprises the majority of the node's data versus a random sampling of 
the entire cluster.  IOW, while a node may specialize in a certain 
subset and also contain replicated rows from outside that subset, it 
will still only (mostly) be queried for data from within it's subset and 
thus the cache will contain mostly data from this special subset which 
could increase the hit rate of the cache.

This may not be a huge help for TB sized data nodes since the even 32 GB 
of RAM would still be relatively tiny in comparison to the data size, 
but I include it just in case it spurs other ideas.  Also, I do not know 
how Cass decides on which node to query for data in the first place... 
maybe not the best idea.

4.)  Compressed Columns:

Some sort of data compression of certain columns could be very helpful 
especially since text can be compressed to less than 50% if the 
conditions are right.  Overall native disk compression will not help the 
bandwidth issue since the data would be decompressed before transit.  If 
the data was stored compressed, then Cass could even send the data to 
the client compressed so as to offload the decompression to the client.  
Likewise, during node commission, the data would never have to be 
decompressed saving on CPU and BW.  Alternately, a client could tell 
Cass to decompress the data before transmit if needed.  This, combined 
with idea #1 (transplants) could help speed-up new node bootstraping, 
but only when a large portion of the data consists of very large column 
values and thus compression is practical and efficient.  Of course, the 
client could handle all the compression today without Cass even knowing 
about it, so building this into Cass would be just a convenience, but 
still nice to have, nonetheless.

5.)  Postponed Major Compactions:

The option to postpone auto-triggered major compactions until a 
pre-defined time of day or week or until staff can do it manually.

6.?)  Finally, some have suggested just using more nodes with less data 
storage which may solve most if not all of these problems.  But, I'm 
still fuzzy on that.  The trade-offs would be more infrastructure and 
maintenance costs, higher chance that a server will fail... maybe higher 
bandwidth between nodes due to a large cluster???  I need more clarity 
on this alternative.  Imagine a total data size of 100 TBs and the 
choice between 200 nodes or 50.  What is the cost of more nodes; all 
things being equal?

Please contribute additional ideas and strategies/patterns for the 
benefit of all!

Thanks for listening and keep up the good work guys!