Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of sdmnix@gmail.com designates
 209.85.161.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=ArojUC6PvbxyWYa0a4Ohi7ULLUmvFIgILRiLTtc54glviSEaOiJdvoNXKDOWO/+zmf
         dLDXFGgCFvy/E4bEvNO0ts+zvgUeVF215yHiBMWbXlaFwR7FiYNBowQ6BqDMQVyo093n
         mbYoeb2Ip++AKk5esJdMs5ZaotI4z8U98HR3E=
MIME-Version: 1.0
In-Reply-To: <AANLkTinczLnTXYToMU_-=0eP3dxoLiNQ=NAu5gZUdesm@mail.gmail.com>
References: <AANLkTinczLnTXYToMU_-=0eP3dxoLiNQ=NAu5gZUdesm@mail.gmail.com>
Date: Sun, 19 Sep 2010 11:53:54 -0600
Message-ID: <AANLkTi=ERZCgz5s7BK2VgfnCkWJ6Exr3CXTSwwB0Rs5s@mail.gmail.com>
Subject: Re: a few generic questions
From: Scott Mann <sdmnix@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Mario,

I'll take a shot at answering a few of these, but mostly at this
point, I'd recommend looking at the available documentation. Start at
http://wiki.apache.org/cassandra/FrontPage.

More comments below.

>
> Removal of data:
> If I delete delete data from my cluster will there over time be nodes tha=
t
> will have more/less data than the average node?
> Will it lead to an imbalanced distribution of data or will Cassandra move
> some data between nodes to keep them evenly used?

To effectively answer this question, you must learn about the
available partitioning schemes in Cassandra. Start with
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-order=
preservingpartitioner/.
This will give you some background on the way that Cassandra spreads
its data around the cluster. After that, start looking through the
articles and email archives for "tombstone." Basically, when a row is
deleted, it is specially marked as a tombstone. After a period of time
(GCGraceSeconds in storage-conf.xml in v0.6.x - search for this, too),
the tombstones will be removed. Once you've got that down, you'll need
to learn about "compaction" and "repair" (I don't mean read-repair, in
this case, but you should learn about that also). Well, I've actually
deviated a bit from your question, but you should read as much as you
can find about all those terms.

> -----
> Server-Load:
> If I have a small portion of data that is read very often which is
> unfortunately on the same node.
> Will this lead to an unbalanced Server-Load or will Cassandra distribute
> data also based on how often it it accessed?

In v0.6.x, I do not know of an automated way to manage this, but
checkout "nodetool move." You should also come to an understanding of
"nodetool loadbalance," although I don't think it is what you want
here.

> There is this comment on the auto_bootstap documentation:
> (If no InitialToken is specified, they will pick one such that they will =
get
> half the range of the most-loaded node.)
> Does this mean the CPU Load or data load/storage?

This is talking about how tokens are assigned to a new node in a
Cassandra cluster - so data load/storage. These tokens are what
Cassandra uses to sort out which node holds which data. Basically, if
you set autobootstrap to true without an initial token, then depending
upon the partitioning scheme (discussed above), a token will be
selected from about the middle of the node with the most data. This
may or may not do what you want and it may or may not load balance
things in an appropriate way. Tokens can always be assigned manually.
See the documentation around nodetool arguments move, loadbalance,
removetoken, and ring.

> -----
> Node down:
> If I have a node that went down and took all its data with it.
> Will a new node with auto_bootstrap true will replace it or do I need to
> specify the token of the lost node?

Autobootstrap doesn't have anything to do with replacing the data of a
lost node...see above and go read some more about it. What you are
interested in is "replication factor," referred to as
ReplicationFactor in storage-conf.xml. You need to set your
replication factor to the number of copies of the entire dataset to
determine how much redundancy you want/need. The other element
associated with this is something called "consistency level" aka
(ConsistencyLevel). Search for these terms as well during your
research of Cassandra.

Specifically, if you have ReplicationFactor set to 1, and a node goes
down, you lose the data on that node. If you have ReplicationFactor
set to 2, and a node goes down, you still have a copy of the dead
node's data in the cluster. See
http://wiki.apache.org/cassandra/Operations for more information about
data replication.

I realize that most of my comments have been "go look for this term"
and "go read about this," but I am speaking from experience when I say
that this is really the most effective way to learn about Cassandra.
Oh, and the other thing, forget everything you think you know about
relational databases - Cassandra offers a completely different model.
>
> Thank you in advance for your help,
> =A0Mario

Hope it helps.
-Scott