cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lin <wool...@gmail.com>
Subject Re: Storing bi-temporal data in Cassandra
Date Fri, 20 Feb 2015 22:02:57 GMT
I think I get the basics of what you want to achieve. Side note, the sample
insert seems to have a typo for the transaction time

For the first query, I would store the data using weatherstation _id as the
key. The create table statement might look like this.

CREATE TABLE weatherstation (
weatherstation_id text,
latest_temperature int,
PRIMARY KEY (weatherstation_id)
)
Each new sensor reading would be a new dynamic column and it would update
the value for latest_temperature

For the second query, I get the impression the transactions are done in
batches. Is that interpretation correct?

Assuming inserts are in batches, I would store the same sensor data in a
second table. It might look like this.
CREATE TABLE readings_by_transaction (
transaction_time timestamp,
PRIMARY_KEY (transaction_time)
)

You could add a static column and use collection type to store the values,
or you can use dynamic columns. To get all values for a given transaction
time, you'd simply select all columns where the row_key is equal to the
timestamp. With the new collections, it's possible to static column that
uses map<weatherstation_id,temperature>. Each new sensor reading would be
an entry in the map.

Taking a step back and looking at the larger problem space. If we use a
loose definition of bi-temporal database, the use case fits, even if it
isn't like classic bi-temporal examples. In this specific case, the
temperature readings are valid only for the instance in time the value was
read. It's not possible to go back in time and re-read the temperature
value, so it's purely write once and read many.

One of the classic bi-temporal query examples is "get me all valid
addresses for [person x] between 2010 and 2012". My guess is your use case
will never need this type of query, given it's temperature data. The
approach I tend to go with is to store the raw data with an unique
immutable key and then have one or more summary tables. It does increase
the storage needs, but there are ways to handle that.

hope that helps




On Fri, Feb 20, 2015 at 10:20 AM, Raj N <raj.cassandra@gmail.com> wrote:

> Thanks for the response Peter. I used the temperature table because its
> the most common example on CQL timeseries and I thought I would reuse it.
> From some of the responses, looks like I was wrong.
>
> event_time is the time the event happened. So yes it is valid time. I was
> trying to see if I can get away with not having valid_from and valid_to in
> Cassandra.
> transaction_time is the time the database record was written.
>
> Let's take an example -
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-18 07:01:00′,’72F’);
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 08:01:00′,’2015-02-18 08:01:00′,’72F’);
>
> And I get an update for the first record tomorrow, I want to keep both
> versions. So I would have -
>
> INSERT INTO
> temperature(weatherstation_id,event_time,transaction_time,temperature)
> VALUES (’1234ABCD’,’2015-02-18 07:01:00′,’2015-02-*19* 07:01:00′,’72F’);
>
> I fundamentally need to execute 2 types of queries -
>
> 1. select the latest values for the weatherstation for a given event time
> period which should ideally just return the first and third record.
> 2. select the values as of particular transaction time(say ’2015-02-18
> 08:01:00′), in which case I would expect to return first and second
> record.
>
> About your comment on having valid_time in the keys, do I have a choice in
> Cassandra, unless you are suggesting to use secondary indexes.
>
> I am new to bi-temporal data modeling. So please advise if you think
> building this on top of Cassandra is a stupid idea.
>
> -Rajesh
>
>
> On Sun, Feb 15, 2015 at 10:03 AM, Peter Lin <woolfel@gmail.com> wrote:
>
>>
>> I've built several different bi-temporal databases over the year for a
>> variety of applications, so I have to ask "why are you modeling it this
>> way?"
>>
>> Having a temperatures table doesn't make sense to me. Normally a
>> bi-temporal database has transaction time and valid time. The transaction
>> time is the timestamp of when the data is saved and valid time is usually
>> expressed as effective & expiration time.
>>
>> In the example, is event time suppose to be valid time and what is the
>> granularity (seconds, hours, or day)?
>> Which kind of queries do you need to query most of the time?
>> Why is event_time and transaction_time part of the key?
>>
>> If you take time to study temporal databases, having valid time in the
>> keys will cause a lot of headaches. Basically, it's an anti-pattern for
>> temporal databases. The key should only be the unique identifier and it
>> shouldn't have transaction or valid time. The use case you describe looks
>> more like regular time series data and not really bi-temporal.
>>
>> I would suggest take time to understand the kinds of queries you need to
>> run and then change the table.
>>
>> peter
>>
>>
>> On Sun, Feb 15, 2015 at 9:00 AM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> I had forgotten, but there is a new tuple notation to iterate over more
>>> than one clustering column in C* 2.0.6:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>>
>>> For example,
>>>
>>> SELECT ... WHERE (c1, c2) > (1, 0)
>>>
>>> There's an example in the CQL spec:
>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Feb 14, 2015 at 6:29 PM, Dave Brosius <dbrosius@mebigfatguy.com>
>>> wrote:
>>>
>>>>  As you point out, there's not really a node-based problem with your
>>>> query from a performance point of view. This is a limitation of CQL in
>>>> that, cql wants to slice one section of a partition's row (no matter how
>>>> big the section is). In your case, you are asking to slice multiple
>>>> sections of a partition's row, which currently isn't supported.
>>>>
>>>> It seems silly perhaps that this is the case, as certainly in your
>>>> example it would seem useful, and not to difficult, but the problem is that
>>>> you can wind up with n-depth slicing of that partitioned row given an
>>>> arbitrary query syntax if range queries on clustering keys was allowed
>>>> anywhere.
>>>>
>>>> At present, you can either duplicate the data using the other
>>>> clustering key (transaction_time) as primary clusterer for this use
>>>> case, or omit the 3rd criterion (transaction_time = 'xxxx')in the
>>>> query and get all the range query results and filter on the client.
>>>>
>>>> hth,
>>>> dave
>>>>
>>>>
>>>>
>>>> On 02/14/2015 06:05 PM, Raj N wrote:
>>>>
>>>> I don't think thats solves my problem. The question really is why can't
>>>> we use ranges for both time columns when they are part of the primary key.
>>>> They are on 1 row after all. Is this just a CQL limitation?
>>>>
>>>>  -Raj
>>>>
>>>> On Sat, Feb 14, 2015 at 3:35 AM, DuyHai Doan <doanduyhai@gmail.com>
>>>> wrote:
>>>>
>>>>>  "I am trying to get the state as of a particular transaction_time"
>>>>>
>>>>>   --> In that case you should probably define your primary key in
>>>>> another order for clustering columns
>>>>>
>>>>>  PRIMARY KEY (weatherstation_id,transaction_time,event_time)
>>>>>
>>>>>  Then, select * from temperatures where weatherstation_id = 'foo' and
>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>> 00:00:00' and transaction_time = 'xxxx'
>>>>>
>>>>>
>>>>>
>>>>>  On Sat, Feb 14, 2015 at 3:06 AM, Raj N <raj.cassandra@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Has anyone designed a bi-temporal table in Cassandra? Doesn't look
>>>>>> like I can do this using CQL for now. Taking the time series example
from
>>>>>> well known modeling tutorials in Cassandra -
>>>>>>
>>>>>>  CREATE TABLE temperatures (
>>>>>> weatherstation_id text,
>>>>>> event_time timestamp,
>>>>>> temperature text,
>>>>>> PRIMARY KEY (weatherstation_id,event_time),
>>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC);
>>>>>>
>>>>>>  If I add another column transaction_time
>>>>>>
>>>>>>  CREATE TABLE temperatures (
>>>>>> weatherstation_id text,
>>>>>> event_time timestamp,
>>>>>> transaction_time timestamp,
>>>>>>  temperature text,
>>>>>> PRIMARY KEY (weatherstation_id,event_time,transaction_time),
>>>>>> ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC);
>>>>>>
>>>>>>  If I try to run a query using the following CQL, it throws an error
>>>>>> -
>>>>>>
>>>>>>  select * from temperatures where weatherstation_id = 'foo' and
>>>>>> event_time >= '2015-01-01 00:00:00' and event_time < '2015-01-02
>>>>>> 00:00:00' and transaction_time < '2015-01-02 00:00:00'
>>>>>>
>>>>>>  It works if I use an equals clause for the event_time. I am trying
>>>>>> to get the state as of a particular transaction_time
>>>>>>
>>>>>>  -Raj
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message