incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Kelly <clint.ke...@gmail.com>
Subject Re: Getting the most-recent version from time-series data
Date Wed, 26 Feb 2014 02:00:21 GMT
Hi Jonathan,

Thanks for the suggestion!  I see a couple of problems with this approach:

1. I do not know a priori all of the family names (so I still would not
know what value to use for LIMIT).

2. The "versions" here are similar to timestamps, so one "family" may get
updated far more often than the other.  Hence, if I order all of my data by
version, then the first 1000 rows in version order could all be from the
same family---I want to just get  the most recent value (or N-most recent
values) for each unique family.

I don't think there is a way to do this without performing some client-side
filtering, but I thought I'd see if anyone has any ideas.  I'm translating
a framework that was originally designed on top of HBase, so offering this
kind of functionality (by using HBases "timestamp dimension") was
previously easy.  :)

Best regards,
Clint




On Tue, Feb 25, 2014 at 4:51 PM, Jonathan Lacefield <jlacefield@datastax.com
> wrote:

> Clint
>
>    One approach would be to create a copy of this table and switch the
> clustering columns around so version precedes family.  This way you
> could easily grab the 1st, 2nd, N version rows.  Would this help you
> in your situation?
>
> Jonathan
>
> > On Feb 25, 2014, at 7:49 PM, Clint Kelly <clint.kelly@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > Let's say that I have a table that looks like the following:
> >
> > CREATE TABLE time_series_stuff (
> >   key text,
> >   family text,
> >   version int,
> >   val text,
> >   PRIMARY KEY (key, family, version)
> > ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
> >   bloom_filter_fp_chance=0.010000 AND
> >   caching='KEYS_ONLY' AND
> >   comment='' AND
> >   dclocal_read_repair_chance=0.000000 AND
> >   gc_grace_seconds=864000 AND
> >   index_interval=128 AND
> >   read_repair_chance=0.100000 AND
> >   replicate_on_write='true' AND
> >   populate_io_cache_on_flush='false' AND
> >   default_time_to_live=0 AND
> >   speculative_retry='99.0PERCENTILE' AND
> >   memtable_flush_period_in_ms=0 AND
> >   compaction={'class': 'SizeTieredCompactionStrategy'} AND
> >   compression={'sstable_compression': 'LZ4Compressor'};
> >
> > cqlsh:fiddle> select * from time_series_stuff ;
> >
> >  key    | family  | version | val
> > --------+---------+---------+--------
> >  monday | revenue |       3 | $$$$$$
> >  monday | revenue |       2 |    $$$
> >  monday | revenue |       1 |     $$
> >  monday | revenue |       0 |      $
> >  monday | traffic |       2 | medium
> >  monday | traffic |       1 |  light
> >  monday | traffic |       0 |  heavy
> >
> > (7 rows)
> >
> > Now let's say that I'd like to perform a query that gets me the most
> recent N versions of "revenue" and "traffic."
> >
> > Is there a CQL query to do this?  Let's say that N=1.  Then I know that
> I can do:
> >
> > cqlsh:fiddle> select * from time_series_stuff where key='monday' and
> family='revenue' limit 1;
> >
> >  key    | family  | version | val
> > --------+---------+---------+--------
> >  monday | revenue |       3 | $$$$$$
> >
> > (1 rows)
> >
> > cqlsh:fiddle> select * from time_series_stuff where key='monday' and
> family='traffic' limit 1;
> >
> >  key    | family  | version | val
> > --------+---------+---------+--------
> >  monday | traffic |       2 | medium
> >
> > (1 rows)
> >
> > But what if I have lots of "families" and I want to get the most recent
> N versions of all of them in a single CQL statement.  Is that possible?
>  Unfortunately I am working on something where the family names and the
> number of most-recent versions are not known a priori (I am porting some
> code that was designed for HBase).
> >
> > Best regards,
> > Clint
>

Mime
View raw message