incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tupshin Harper <tups...@tupshin.com>
Subject Re: Getting the most-recent version from time-series data
Date Wed, 26 Feb 2014 03:15:11 GMT
Hi Clint,

What you are describing could actually be accomplished with the Thrift API
and a multiget_slice with a slicerange having a count of 1. Initially I was
thinking that this was an important feature gap between Thrift and CQL, and
was going to suggest that it should be implemented (possible syntax is in
https://issues.apache.org/jira/browse/CASSANDRA-6167 which is almost a
superset of this feature).

But then I was convinced by some colleagues, that with a modern CQL driver
that is token aware, you are actually better off (in terms of latency,
throughput, and reliability), by doing each query separately on the client.

The reasoning is that if you did this with a single query, it would
necessarily be sent to a coordinator that wouldn't own most of the data
that you are looking for. That coordinator would then need to fan out the
read to all the nodes owning the partitions you are looking for.

Far better to just do it directly on the client. The token aware client
will send each request for a row straight to a node that owns it. With a
separate connection open to each node, this is done in parallel from the
get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with
a stored procedure, very very little additional overhead to the client,
server, or network.

-Tupshin


On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.kelly@gmail.com> wrote:

> Hi everyone,
>
> Let's say that I have a table that looks like the following:
>
> CREATE TABLE time_series_stuff (
>   key text,
>   family text,
>   version int,
>   val text,
>   PRIMARY KEY (key, family, version)
> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
>   bloom_filter_fp_chance=0.010000 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.000000 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.100000 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
> cqlsh:fiddle> select * from time_series_stuff ;
>
>  key    | family  | version | val
> --------+---------+---------+--------
>  monday | revenue |       3 | $$$$$$
>  monday | revenue |       2 |    $$$
>  monday | revenue |       1 |     $$
>  monday | revenue |       0 |      $
>  monday | traffic |       2 | medium
>  monday | traffic |       1 |  light
>  monday | traffic |       0 |  heavy
>
> (7 rows)
>
> Now let's say that I'd like to perform a query that gets me the most
> recent N versions of "revenue" and "traffic."
>
> Is there a CQL query to do this?  Let's say that N=1.  Then I know that I
> can do:
>
> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
> family='revenue' limit 1;
>
>  key    | family  | version | val
> --------+---------+---------+--------
>  monday | revenue |       3 | $$$$$$
>
> (1 rows)
>
> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
> family='traffic' limit 1;
>
>  key    | family  | version | val
> --------+---------+---------+--------
>  monday | traffic |       2 | medium
>
> (1 rows)
>
> But what if I have lots of "families" and I want to get the most recent N
> versions of all of them in a single CQL statement.  Is that possible?
> Unfortunately I am working on something where the family names and the
> number of most-recent versions are not known a priori (I am porting some
> code that was designed for HBase).
>
> Best regards,
> Clint
>

Mime
View raw message