cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: emptying my cluster
Date Wed, 04 Jan 2012 20:54:01 GMT
Some thoughts on the plan:

* You are monkeying around with things, do not be surprised when surprising things happen.

* Deliberately unbalancing the cluster may lead to Bad Things happening. 
* In the design discussed it is perfectly reasonable for data not to be on the archive node.

* Truncate is a cluster wide operation and all nodes must be online before it will start.

* Truncate will snapshot before deleting data, you could use this snapshot. 
* TTL for a column is for a column no matter which node it is on. 
* IMHO Cassandra data files (sstables or JSON dumps) are not a good format for a historical
archive, nothing against Cassandra. You need the lowest common format. 

If you have the resources for a second cluster could you put the two together and just have
one cluster with a very large retention policy? One cluster is easier than two.  

Assuming there is no business case for this, consider either:

* Dumping the historical data into a Hadoop (with or without HDFS) cluster with high compression.
If needed you could then run Hive / Pig to fill a companion Cassandra cluster with data on
demand. Or just query using Hadoop.
* Dumping the historical data to files with high compression and a roll your own solution
to fill a cluster. 

Also considering talking to Data Stax about DSE. 

Aaron Morton
Freelance Developer

On 5/01/2012, at 1:41 AM, Alexandru Sicoe wrote:

> Hi,
> On Tue, Jan 3, 2012 at 8:19 PM, aaron morton <> wrote:
> Running a time based rolling window of data can be done using the TTL. Backing up the
nodes for disaster recover can be done using snapshots. Restoring any point in time will be
tricky because to may restore columns where the TTL has expired. 
> Yeah, that's the thing...if I want to use the system as I explain further below, I cannot
do backing up of data (for later restoration) if I'm using TTLs. 
>> Will I get a single copy of the data in the remote storage or will it be twice the
data (data + replica)?
> You will  RF copies of the data. (By the way, there is no original copy)
> Well, if I organize the cluster as I mentioned in the first email, I will get one copy
of each row at a certain point in time on node2 if I take it offline, perform a major compaction
and GC, won't I? I don't want to send duplicated data to the mass storage!
> Can you share a bit more about the use case ? How much data and what sort of read patterns
> I have several applications that feed into Cassandra about 2 million different variables
(each representing a different monitoring value/channel). The system receives updates for
each of these monitoring values at different rates. For each new update, the timestamp and
value are recorded in a Cassandra name-value pair. The schema of Cassandra is built using
one CF for data and 4 other CFs for metadata (metadata CFs are static - don't grow almost
at all once they've been loaded). The data CF uses a row for each variable. Each row acts
as a 4 hour time bin. I achieve this by creating the row key as a concatenation of  the first
6 digits of the timestamp at which the data is inserted + the unique ID of the variable. After
the time bin expires, a new row will be created for the same variable ID.
> The system can currently sustain the insertion load. Now I'm looking into organizing
the flow of data out of the cluster and retrieval performance for random queries:
> Why do I need to organize the data out? Well, my requirement is to keep all the data
coming into the system at the highest granularity for long term (several years). The 3 node
cluster I mentioned is the online cluster which is supposed to be able to absorb the input
load for a relatively short period of time, a few weeks (I am constrained to do this). After
this period the data has to be shipped out of the cluster in a mass storage facility and the
cluster needs to be emptied to make room for more data. Also, the online cluster will serve
reads while it takes in data. For older data I am planning to have another cluster that gets
loaded with data from the storage facility on demand and will serve reads from there.
> Why random queries? There is no specific use case about them, that's why I want to rely
only on the built in Cassandra indexes for now.  Generally the client will ask for sets of
values within a time range up to 8-10 hours in the past. Apart from some sets of variables
that will be almost always asked together, any combination is possible because this system
will feed in a web dashboard which will be used for debugging purposes  - to correlate and
aggregate streams of variables. Depending on the problem, different variable combinations
could be investigated. 
> Can you split the data stream into a permanent log record and also into cassandra for
a rolling window of query able data ?   
> In the end, essentially that's what I've been meaning to do with organizing the cluster
in a 2 DC setup: i wanted to have 2 nodes in DC1 taking the data and reads (the rolling window)
and replicating to the node in DC2 (the permanent log - of a single copy of the data). I was
thinking of implementing the rolling window by emptying the nodes in DC1 using truncate instead
of what you propose now with the rolling window using TTL. 
> Ok, so I can do what you are saying easily if Cassandra allows me to have a TTL only
on the first copy of the data and have the second replica without a TTL. Is this possible?
I think it would solve my problem, as long as I can backup and empty the node in DC2 before
the TTLs expire in the other 2 nodes.
> Cheers,
> Alex
> Cheers
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> On 3/01/2012, at 11:41 PM, Alexandru Sicoe wrote:
>> Hi,
>> I need to build a system that stores data for years, so yes, I am backing up data
in another mass storage system from where it could  be later accessed. The data that I successfully
back up has to be deleted from my cluster to make space for new data coming in.
>> I was aware about the snapshotting which I will use for getting the data out of node2:
it creates hard links to the SSTables of a CF and then I can copy over those files pointed
to by the hard links into another location. After that I get rid of the snapshot (hard links)
and then I can truncate my CFs. It's clear that snapshotting will give me a single copy of
the data in case I have a unique copy of the data on one node. It's not clear to me what happens
if I have let's say a cluster with 3 nodes and RF=2 and I do a snapshot of every node and
copy those snapshots to remote storage. Will I get a single copy of the data in the remote
storage or will it be twice the data (data + replica)?
>> I've started reading about TTL and I think I can use it but it's not clear to me
how it would work in conjunction with the snapshotting/backing up I need to do. I mean, it
will impose a deadline by which I need to perform a backup in order not to miss any data.
Also, I might duplicate the data if some columns don't expire fully between 2 backups. Any
clarifications on this?
>> Cheers,
>> Alex
>> On Tue, Jan 3, 2012 at 9:44 AM, aaron morton <> wrote:
>> That sounds a little complicated. 
>> Do you want to get the data out for an off node backup or is it for processing in
another system ? 
>> You may get by using:
>> * TTL to expire data via compaction
>> * snapshots for backups
>> Cheers
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> On 3/01/2012, at 11:00 AM, Alexandru Sicoe wrote:
>>> Hi everyone and Happy New Year!
>>> I need advice for organizing data flow outside of my 3 node Cassandra 0.8.6 cluster.
I am configuring my keyspace to use the NetworkTopologyStrategy. I have 2 data centers each
with a replication factor 1 (i.e. DC1:1; DC2:1) the configuration of the PropertyFileSnitch
>>>                                                                    ip_node1=DC1:RAC1
>>> I assign tokens like this:
>>>                         node1 = 0
>>>                         node2 = 1
>>>                         node3 = 85070591730234615865843651857942052864
>>> My write consistency level is ANY.
>>> My data sources are only inserting data in node1 & node3. Essentially what
happens is that a replica of every input value will end up on node2. Node 2 thus has a copy
of the entire data written to the cluster. When Node2 starts getting full, I want to have
a script which pulls it off-line and does a sequence of operations (compaction/snapshotting/exporting/truncating
the CFs) in order to back up the data in a remote place and to free it up so that it can take
more data. When it comes back on-line it will take hints from the other 2 nodes.
>>> This is how I plan on shipping data out of my cluster without any downtime or
any major performance penalty. The problem is when I want to also truncate the CFs in node1
& node3 to also free them up of data. I don't know whether I can do this without any downtime
or without any serious performance penalties. Is anyone using truncate to free up CFs of data?
How efficient is this?
>>> Any observations or suggestions are much appreciated!
>>> Cheers,
>>> Alex

View raw message