cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Mann <sdm...@gmail.com>
Subject Re: a few generic questions
Date Sun, 19 Sep 2010 19:40:18 GMT
On Sun, Sep 19, 2010 at 11:53 AM, Scott Mann <sdmnix@gmail.com> wrote:
> Hi Mario,
>
> I'll take a shot at answering a few of these, but mostly at this
> point, I'd recommend looking at the available documentation. Start at
> http://wiki.apache.org/cassandra/FrontPage.
>
> More comments below.
>
>>
>> Removal of data:
>> If I delete delete data from my cluster will there over time be nodes that
>> will have more/less data than the average node?
>> Will it lead to an imbalanced distribution of data or will Cassandra move
>> some data between nodes to keep them evenly used?
>
> To effectively answer this question, you must learn about the
> available partitioning schemes in Cassandra. Start with
> http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/.
> This will give you some background on the way that Cassandra spreads
> its data around the cluster. After that, start looking through the
> articles and email archives for "tombstone." Basically, when a row is
> deleted, it is specially marked as a tombstone. After a period of time
> (GCGraceSeconds in storage-conf.xml in v0.6.x - search for this, too),
> the tombstones will be removed. Once you've got that down, you'll need
> to learn about "compaction" and "repair" (I don't mean read-repair, in
> this case, but you should learn about that also). Well, I've actually
> deviated a bit from your question, but you should read as much as you
> can find about all those terms.
>
>> -----
>> Server-Load:
>> If I have a small portion of data that is read very often which is
>> unfortunately on the same node.
>> Will this lead to an unbalanced Server-Load or will Cassandra distribute
>> data also based on how often it it accessed?
>
> In v0.6.x, I do not know of an automated way to manage this, but
> checkout "nodetool move." You should also come to an understanding of
> "nodetool loadbalance," although I don't think it is what you want
> here.
>
>> There is this comment on the auto_bootstap documentation:
>> (If no InitialToken is specified, they will pick one such that they will get
>> half the range of the most-loaded node.)
>> Does this mean the CPU Load or data load/storage?
>
> This is talking about how tokens are assigned to a new node in a
> Cassandra cluster - so data load/storage. These tokens are what
> Cassandra uses to sort out which node holds which data. Basically, if
> you set autobootstrap to true without an initial token, then depending
> upon the partitioning scheme (discussed above), a token will be
> selected from about the middle of the node with the most data. This
> may or may not do what you want and it may or may not load balance
> things in an appropriate way. Tokens can always be assigned manually.
> See the documentation around nodetool arguments move, loadbalance,
> removetoken, and ring.
>
>> -----
>> Node down:
>> If I have a node that went down and took all its data with it.
>> Will a new node with auto_bootstrap true will replace it or do I need to
>> specify the token of the lost node?
>
> Autobootstrap doesn't have anything to do with replacing the data of a
> lost node...see above and go read some more about it. What you are
> interested in is "replication factor," referred to as
> ReplicationFactor in storage-conf.xml. You need to set your
> replication factor to the number of copies of the entire dataset to
> determine how much redundancy you want/need. The other element
> associated with this is something called "consistency level" aka
> (ConsistencyLevel). Search for these terms as well during your
> research of Cassandra.
>
> Specifically, if you have ReplicationFactor set to 1, and a node goes
> down, you lose the data on that node. If you have ReplicationFactor
> set to 2, and a node goes down, you still have a copy of the dead
> node's data in the cluster. See
> http://wiki.apache.org/cassandra/Operations for more information about
> data replication.
>
> I realize that most of my comments have been "go look for this term"
> and "go read about this," but I am speaking from experience when I say
> that this is really the most effective way to learn about Cassandra.
> Oh, and the other thing, forget everything you think you know about
> relational databases - Cassandra offers a completely different model.

Oh, and I forgot to mention that you need to learn about replication
stategy and snitches. :)
These also have to do with how the data ends up where it ends up...

>>
>> Thank you in advance for your help,
>>  Mario
>
> Hope it helps.
> -Scott
>



-- 
-Scott

Mime
View raw message