incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Stanhope <pstanh...@wimba.com>
Subject Re: Finding new Cassandra data
Date Tue, 22 Jun 2010 15:57:37 GMT
I can envision two fundamentally different approaches:

1. A CF that is CompareWith LONG ... use microsecond timestamps as your keys ... then you
can filter by time ranges.

This implies that you are willing to do a double write (once for the original data and then
again for the logging). And a third read of a range_slice (which will most likely require
pagination) to determine what to then push into your other system.

Which begs a question ... if you know you are inserting and generating keys ... and you know
the keyname ... why not simply push the key into a queue (non-Cassandra) and do processing
against that. So ...

2. Don't store new row keys in a CF ... at the point of using the thrift API simply build
a log of new keys and process that log asynchronously.

This approach causes you to ask yourself another question: of the nodes in my cluster, am
I willing to declare that some of those nodes are only available for write-thru processing.
It's not Cassandra's job to make these decisions for you ... it's an applications decision.
If you allow all nodes to perform writes, then you'll either have to consolidate logs or introduce
some form of common queue for coordination of the async updates to non-Cassandra data stores.

-phil

On Jun 22, 2010, at 11:18 AM, Gary Dusbabek wrote:

> On Tue, Jun 22, 2010 at 09:59, David Boxenhorn <david@lookin2.com> wrote:
>> In my system, I have a Cassandra front end, and an Oracle back end. Some
>> information is created in the back end, and pushed out to the front end, and
>> some information is created in the front end and pulled into the back end.
>> 
>> Question: How do I locate new rows that have been crated in Cassandra, for
>> import into Oracle?
>> 
>> I'm thinking of having a special column family "newRows" that contains only
>> the keys of the new rows. The offline process would look there to see what's
>> new, then delete those rows. The "newRows" CF would have no data! (The data
>> would be in the "real" CF.)
> 
> I've never tried an empty row, but I'm pretty sure you need at least one column.
> 
>> 
>> Is this a good solution? It seems weird to have a CF with rows but no data.
>> But I can't think of a better way.
>> 
>> Any thoughts?
> 
> Another approach would be to have a CF with a single row whose column
> names refer to the new row ids.  This would allow you efficient
> slicing.  The downside is that you'd need to make sure the row doesn't
> get too wide.  So depending on your throughput and application
> behavior, this may or may not work.
> 
> Gary.


Mime
View raw message