Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8205A67B6 for ; Thu, 9 Jun 2011 08:23:41 +0000 (UTC) Received: (qmail 69359 invoked by uid 500); 9 Jun 2011 08:23:38 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 69312 invoked by uid 500); 9 Jun 2011 08:23:38 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 69304 invoked by uid 99); 9 Jun 2011 08:23:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jun 2011 08:23:38 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [204.13.248.74] (HELO mho-02-ewr.mailhop.org) (204.13.248.74) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jun 2011 08:23:30 +0000 Received: from 75-166-66-241.hlrn.qwest.net ([75.166.66.241] helo=[192.168.0.2]) by mho-02-ewr.mailhop.org with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.72) (envelope-from ) id 1QUaWC-000Pjl-RT for user@cassandra.apache.org; Thu, 09 Jun 2011 08:23:09 +0000 X-Mail-Handler: MailHop Outbound by DynDNS X-Originating-IP: 75.166.66.241 X-Report-Abuse-To: abuse@dyndns.com (see http://www.dyndns.com/services/mailhop/outbound_abuse.html for abuse reporting information) X-MHO-User: U2FsdGVkX18PbFhqJlb/dk+yCA4tNYytDG7YK6mY/hI= Message-ID: <4DF082E6.7030700@dude.podzone.net> Date: Thu, 09 Jun 2011 02:23:02 -0600 From: AJ User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.17) Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Ideas for Big Data Support Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit [Please feel free to correct me on anything or suggest other workarounds that could be employed now to help.] Hello, This is purely theoretical, as I don't have a big working cluster yet and am still in the planning stages, but from what I understand, while Cass scales well horizontally, EACH node will not be able to handle well a data store in the terabyte range... for reasons that are understandable such as simple hardware and bandwidth limitations. But, looking forward and pushing the envelope, I think there might be ways to at least manage these issues until broadband speeds, disk and memory technology catches up. The biggest issues with big data clusters that I am currently aware are: > disk I/O probs during major compaction and repairs. > Bandwidth limitations during new node commissioning. Here are a few ideas I've thought of: 1.) Load-balancing: During a major compaction or repair or other similar severe performance impacting processes, allow the node to broadcast that it is temporarily unavailable so requests for data can be sent to other nodes in the cluster. The node could still "wake-up" and pause or cancel it's compaction in the case of a failed node whereby there are no other nodes that can provide the data requested. The node could be considered as "degraded" by other nodes, rather than down. (As a matter of fact, a general load-balancing scheme could be devised if each node broadcasts it's current load level and maybe even hop-count between data centers.) 2.) Data Transplants: Since commissioning a new node that is due to receive data in the TB range (data xfer could take days or weeks), it would be much more efficient to just courier the data. Perhaps the SSTables (maybe from a snapshot) could be transplanted from one production node into a new node to help jump-start the bootstrap process. The new node could sort things out during the bootstrapping phase so that it is balanced correctly as if it had started out with no data as usual. If this could cut down on half the bandwidth, that would be a great benefit. However, this would work well mostly if the transplanted data came from a keyspace that used a random partitioner; coming from an ordered partioner may not be so helpful if the rows in the transplanted data would never be used in the new node. 3.) Strategic Partitioning: Of course, there are surely other issues to contend with, such as RAM requirements for caching purposes. That may be managed by a partition strategy that allows certain nodes to specialize in a certain subset of the data, such as geographically or whatever the designer chooses. Replication would still be done as usual but this may help the cache to be better utilized by allowing it to focus on the subset of data that comprises the majority of the node's data versus a random sampling of the entire cluster. IOW, while a node may specialize in a certain subset and also contain replicated rows from outside that subset, it will still only (mostly) be queried for data from within it's subset and thus the cache will contain mostly data from this special subset which could increase the hit rate of the cache. This may not be a huge help for TB sized data nodes since the even 32 GB of RAM would still be relatively tiny in comparison to the data size, but I include it just in case it spurs other ideas. Also, I do not know how Cass decides on which node to query for data in the first place... maybe not the best idea. 4.) Compressed Columns: Some sort of data compression of certain columns could be very helpful especially since text can be compressed to less than 50% if the conditions are right. Overall native disk compression will not help the bandwidth issue since the data would be decompressed before transit. If the data was stored compressed, then Cass could even send the data to the client compressed so as to offload the decompression to the client. Likewise, during node commission, the data would never have to be decompressed saving on CPU and BW. Alternately, a client could tell Cass to decompress the data before transmit if needed. This, combined with idea #1 (transplants) could help speed-up new node bootstraping, but only when a large portion of the data consists of very large column values and thus compression is practical and efficient. Of course, the client could handle all the compression today without Cass even knowing about it, so building this into Cass would be just a convenience, but still nice to have, nonetheless. 5.) Postponed Major Compactions: The option to postpone auto-triggered major compactions until a pre-defined time of day or week or until staff can do it manually. 6.?) Finally, some have suggested just using more nodes with less data storage which may solve most if not all of these problems. But, I'm still fuzzy on that. The trade-offs would be more infrastructure and maintenance costs, higher chance that a server will fail... maybe higher bandwidth between nodes due to a large cluster??? I need more clarity on this alternative. Imagine a total data size of 100 TBs and the choice between 200 nodes or 50. What is the cost of more nodes; all things being equal? Please contribute additional ideas and strategies/patterns for the benefit of all! Thanks for listening and keep up the good work guys!