Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 815 invoked from network); 19 Sep 2010 17:54:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Sep 2010 17:54:39 -0000 Received: (qmail 37419 invoked by uid 500); 19 Sep 2010 17:54:37 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37375 invoked by uid 500); 19 Sep 2010 17:54:37 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37367 invoked by uid 99); 19 Sep 2010 17:54:36 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Sep 2010 17:54:36 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sdmnix@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Sep 2010 17:54:14 +0000 Received: by fxm9 with SMTP id 9so423850fxm.31 for ; Sun, 19 Sep 2010 10:53:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Ba9WQ+q+bTxOPdG24VavVBa7DjfhPrbOae82IXgSW34=; b=CW4+b+/qO5l3gDxC/hHk7CZGgHcSv0CPEdIz9sA9HvnVAM4c7LwGnk0ibjBYcoZ9Ox 1Hxfnu1MRBfaJdrSrjveoH9n4TTG2+aiUhwD/8pGiJrYrbFlN+kSXb0tw2TvkV8Ibu48 /ihU9DLE+zDXsDip23OYHznYG7IqsE/XRXa0s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ArojUC6PvbxyWYa0a4Ohi7ULLUmvFIgILRiLTtc54glviSEaOiJdvoNXKDOWO/+zmf dLDXFGgCFvy/E4bEvNO0ts+zvgUeVF215yHiBMWbXlaFwR7FiYNBowQ6BqDMQVyo093n mbYoeb2Ip++AKk5esJdMs5ZaotI4z8U98HR3E= MIME-Version: 1.0 Received: by 10.223.113.2 with SMTP id y2mr3659357fap.49.1284918834236; Sun, 19 Sep 2010 10:53:54 -0700 (PDT) Received: by 10.223.32.19 with HTTP; Sun, 19 Sep 2010 10:53:54 -0700 (PDT) In-Reply-To: References: Date: Sun, 19 Sep 2010 11:53:54 -0600 Message-ID: Subject: Re: a few generic questions From: Scott Mann To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Mario, I'll take a shot at answering a few of these, but mostly at this point, I'd recommend looking at the available documentation. Start at http://wiki.apache.org/cassandra/FrontPage. More comments below. > > Removal of data: > If I delete delete data from my cluster will there over time be nodes tha= t > will have more/less data than the average node? > Will it lead to an imbalanced distribution of data or will Cassandra move > some data between nodes to keep them evenly used? To effectively answer this question, you must learn about the available partitioning schemes in Cassandra. Start with http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-order= preservingpartitioner/. This will give you some background on the way that Cassandra spreads its data around the cluster. After that, start looking through the articles and email archives for "tombstone." Basically, when a row is deleted, it is specially marked as a tombstone. After a period of time (GCGraceSeconds in storage-conf.xml in v0.6.x - search for this, too), the tombstones will be removed. Once you've got that down, you'll need to learn about "compaction" and "repair" (I don't mean read-repair, in this case, but you should learn about that also). Well, I've actually deviated a bit from your question, but you should read as much as you can find about all those terms. > ----- > Server-Load: > If I have a small portion of data that is read very often which is > unfortunately on the same node. > Will this lead to an unbalanced Server-Load or will Cassandra distribute > data also based on how often it it accessed? In v0.6.x, I do not know of an automated way to manage this, but checkout "nodetool move." You should also come to an understanding of "nodetool loadbalance," although I don't think it is what you want here. > There is this comment on the auto_bootstap documentation: > (If no InitialToken is specified, they will pick one such that they will = get > half the range of the most-loaded node.) > Does this mean the CPU Load or data load/storage? This is talking about how tokens are assigned to a new node in a Cassandra cluster - so data load/storage. These tokens are what Cassandra uses to sort out which node holds which data. Basically, if you set autobootstrap to true without an initial token, then depending upon the partitioning scheme (discussed above), a token will be selected from about the middle of the node with the most data. This may or may not do what you want and it may or may not load balance things in an appropriate way. Tokens can always be assigned manually. See the documentation around nodetool arguments move, loadbalance, removetoken, and ring. > ----- > Node down: > If I have a node that went down and took all its data with it. > Will a new node with auto_bootstrap true will replace it or do I need to > specify the token of the lost node? Autobootstrap doesn't have anything to do with replacing the data of a lost node...see above and go read some more about it. What you are interested in is "replication factor," referred to as ReplicationFactor in storage-conf.xml. You need to set your replication factor to the number of copies of the entire dataset to determine how much redundancy you want/need. The other element associated with this is something called "consistency level" aka (ConsistencyLevel). Search for these terms as well during your research of Cassandra. Specifically, if you have ReplicationFactor set to 1, and a node goes down, you lose the data on that node. If you have ReplicationFactor set to 2, and a node goes down, you still have a copy of the dead node's data in the cluster. See http://wiki.apache.org/cassandra/Operations for more information about data replication. I realize that most of my comments have been "go look for this term" and "go read about this," but I am speaking from experience when I say that this is really the most effective way to learn about Cassandra. Oh, and the other thing, forget everything you think you know about relational databases - Cassandra offers a completely different model. > > Thank you in advance for your help, > =A0Mario Hope it helps. -Scott