Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Daniel Kluesing <dk@bluekai.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wed, 28 Jul 2010 16:43:03 -0700
Subject: RE: Evaluating Cassandra for our use case
Thread-Topic: Evaluating Cassandra for our use case
Thread-Index: AcsuqkZISSGFC9aSTsaZwCHnaeszsAAAyYzQ
Message-ID: 
 <33FDEB0CE2F65F41A4CF8769247BB3668DE13E14E5@EXVMBX016-3.exch016.msoutlookonline.net>
References: <AANLkTi=j4-coc2xT8ximLBy3qukRwY9fe-gt+MbHM2vz@mail.gmail.com>
In-Reply-To: <AANLkTi=j4-coc2xT8ximLBy3qukRwY9fe-gt+MbHM2vz@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

>Is it possible to configure Cassandra in such a way that a
>node only every asks itself for the data, and if so what sort of
>effect will that have on read performance?

Check out the RingCache class which lets you make your clients smart enough=
 to ask the right server. (Also, if all nodes have all the data like you me=
ntion below, and you have your read consistency set to 1, you won't ask the=
 network nodes.)

>I have also read that Cassandra will distribute data between different
>nodes, while we want all to have a full copy of all data. Is it
>possible to configure Cassandra in this way?

If you set the replication factor to the number of nodes, then every node w=
ill have a full copy. (That might get sticky if you add new servers, since =
I don't think you can change the replication factor once set)

-----Original Message-----
From: Russ Brown [mailto:pickscrape@gmail.com]=20
Sent: Wednesday, July 28, 2010 4:11 PM
To: user@cassandra.apache.org
Subject: Evaluating Cassandra for our use case

Hi,

I'm currently looking at NoSQL solutions to replace a bespoke system
that we currently have in place. Currently I think the best fit is
Cassandra, but I would like to get some feedback from those who know
it better before spending more time on it.

Our current system is geared to allowing our web servers to operate
very quickly and completely independently (for most pages) of other
servers. This is accomplished by keeping chunks of data about "things"
on each machine's disk with a file per entity. The key in this is
effectively the filename, with the value being the file's content. A
central server handles the initial generation (and subsequent updates)
of these files, and distribution to the web servers is carried out by
a combination of network share mounting and shell scripts.

The system *does* work: the servers are very fast and they do work
fine when the servers behind them disappear. However, the storage and
transport mechanisms are cumbersome, and we would like to see if there
are suitable alternatives available.

The idea is to replace the disk-based storage on each server with a
NoSQL solution using replication to handle the transport automatically
for us. What we need is:

 * One "master", though being able to have a backup for it that we
could quickly bring into play would be advantageous
 * Each "slave" must have a full copy of the data
 * It does not matter if the slaves do not get updates immediately or
at exactly the same time, as long as they get there quickly
 * Reads must be fast (though understandably it will probably be
slower than reading a system-cached file direct from disk)
 * It would be a bonus if the slaves could be written to too, with the
writes making their way to the other nodes. This is probably a given,
but I thought I'd mention it anyway.

Now, I have read a few things about Cassandra's read performance which
is what has got me a bit worried. However, I have also read quite a
bit about its flexibility in terms of topology, and that the read
performance is very much dependent on how things are set up. For
example, a lot of what I've read describes how when querying a node it
will ask other nodes for information, which it then collates and
returns. Is it possible to configure Cassandra in such a way that a
node only every asks itself for the data, and if so what sort of
effect will that have on read performance? Our current solution is
designed to avoid having to hit the network, so doing the same here
would be advantageous.

I have also read that Cassandra will distribute data between different
nodes, while we want all to have a full copy of all data. Is it
possible to configure Cassandra in this way?

If this will work, it will be a heck of a lot cleaner and easier to
maintain than the current solution, so we're quite hopeful. :)

Feedback appreciated,

--=20

Russ