incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Kelly <clint.ke...@gmail.com>
Subject Re: Getting the most-recent version from time-series data
Date Fri, 28 Feb 2014 17:47:17 GMT
Hi Tupshin,

Thanks for your help once again, I really appreciate it.  Quick question
regarding the issue of token-aware routing, etc.  Let's say that I am using
the table described earlier:

CREATE TABLE time_series_stuff (
  key text,
  family text,
  version int,
  val text,
  PRIMARY KEY (key, family, version)
) WITH CLUSTERING ORDER BY (family ASC, version DESC)

I want to retrieve values for the most-recent version of every family for a
given key, doing something like:

    SELECT * from time_series_stuff where key='mykey'

but getting only one version per family.

All of this data should live on the same node (or set of replica nodes),
correct?  I am specifying the partition key here, and I thought that only
the partition key determined on what physical nodes data exists.
Therefore, I would think that all of the results from this query would come
from a single replica node (or set of replica nodes, if the consistency
level is greater than 1).

Would you mind clarifying?  Thanks a lot!

Best regards,
Clint






On Wed, Feb 26, 2014 at 4:56 AM, Tupshin Harper <tupshin@tupshin.com> wrote:

> And one last clarification. Where I said "stored procedure" earlier, I
> meant "prepared statement". Sorry for the confusion. Too much typing while
> tired.
>
> -Tupshin
>
>
> On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tupshin@tupshin.com>wrote:
>
>> I failed to address the matter of not knowing the families in advance.
>>
>> I can't really recommend any solution to that other than storing the list
>> of families in another structure that is readily queryable. I don't know
>> how many families you are thinking, but if it is in the millions or more,
>> You might consider constructing another table such as:
>> CREATE TABLE families (
>>   key int,
>>   family text,
>>   PRIMARY KEY (key, family)
>> );
>>
>>
>> store your families there, with a knowable set of keys (I suggest
>> something like the last 3 digits of the md5 hash of the family). So then
>> you could retrieve your families in nice sized batches
>> SELECT family FROM id WHERE key=0;
>> and then do the fan-out selects that I described previously.
>>
>> -Tupshin
>>
>>
>> On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tupshin@tupshin.com>wrote:
>>
>>> Hi Clint,
>>>
>>> What you are describing could actually be accomplished with the Thrift
>>> API and a multiget_slice with a slicerange having a count of 1. Initially I
>>> was thinking that this was an important feature gap between Thrift and CQL,
>>> and was going to suggest that it should be implemented (possible syntax is
>>> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is almost
>>> a superset of this feature).
>>>
>>> But then I was convinced by some colleagues, that with a modern CQL
>>> driver that is token aware, you are actually better off (in terms of
>>> latency, throughput, and reliability), by doing each query separately on
>>> the client.
>>>
>>> The reasoning is that if you did this with a single query, it would
>>> necessarily be sent to a coordinator that wouldn't own most of the data
>>> that you are looking for. That coordinator would then need to fan out the
>>> read to all the nodes owning the partitions you are looking for.
>>>
>>> Far better to just do it directly on the client. The token aware client
>>> will send each request for a row straight to a node that owns it. With a
>>> separate connection open to each node, this is done in parallel from the
>>> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with
>>> a stored procedure, very very little additional overhead to the client,
>>> server, or network.
>>>
>>> -Tupshin
>>>
>>>
>>> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.kelly@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Let's say that I have a table that looks like the following:
>>>>
>>>> CREATE TABLE time_series_stuff (
>>>>   key text,
>>>>   family text,
>>>>   version int,
>>>>   val text,
>>>>   PRIMARY KEY (key, family, version)
>>>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
>>>>   bloom_filter_fp_chance=0.010000 AND
>>>>   caching='KEYS_ONLY' AND
>>>>   comment='' AND
>>>>   dclocal_read_repair_chance=0.000000 AND
>>>>   gc_grace_seconds=864000 AND
>>>>   index_interval=128 AND
>>>>   read_repair_chance=0.100000 AND
>>>>   replicate_on_write='true' AND
>>>>   populate_io_cache_on_flush='false' AND
>>>>   default_time_to_live=0 AND
>>>>   speculative_retry='99.0PERCENTILE' AND
>>>>   memtable_flush_period_in_ms=0 AND
>>>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>>>   compression={'sstable_compression': 'LZ4Compressor'};
>>>>
>>>> cqlsh:fiddle> select * from time_series_stuff ;
>>>>
>>>>  key    | family  | version | val
>>>> --------+---------+---------+--------
>>>>  monday | revenue |       3 | $$$$$$
>>>>  monday | revenue |       2 |    $$$
>>>>  monday | revenue |       1 |     $$
>>>>  monday | revenue |       0 |      $
>>>>  monday | traffic |       2 | medium
>>>>  monday | traffic |       1 |  light
>>>>  monday | traffic |       0 |  heavy
>>>>
>>>> (7 rows)
>>>>
>>>> Now let's say that I'd like to perform a query that gets me the most
>>>> recent N versions of "revenue" and "traffic."
>>>>
>>>> Is there a CQL query to do this?  Let's say that N=1.  Then I know that
>>>> I can do:
>>>>
>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>> family='revenue' limit 1;
>>>>
>>>>  key    | family  | version | val
>>>> --------+---------+---------+--------
>>>>  monday | revenue |       3 | $$$$$$
>>>>
>>>> (1 rows)
>>>>
>>>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>>>> family='traffic' limit 1;
>>>>
>>>>  key    | family  | version | val
>>>> --------+---------+---------+--------
>>>>  monday | traffic |       2 | medium
>>>>
>>>> (1 rows)
>>>>
>>>> But what if I have lots of "families" and I want to get the most recent
>>>> N versions of all of them in a single CQL statement.  Is that possible?
>>>> Unfortunately I am working on something where the family names and the
>>>> number of most-recent versions are not known a priori (I am porting some
>>>> code that was designed for HBase).
>>>>
>>>> Best regards,
>>>> Clint
>>>>
>>>
>>>
>>
>

Mime
View raw message