incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: SV: Using Cassandra for storing measurement data
Date Tue, 03 Aug 2010 09:44:34 GMT
As Justus said, you need to consider the way you want to get the data back and then denormalise
to suit. Do you need to support ad-hoc queries or will you know how you want to query ahead
of time?

Some different approaches may be

Standard CF to hold the measurements taken, grouped by day
{
device_id/20100810 : { date_and_time : value, 
                                  date_and_time : value 
                               }
}
- this spreads the write for each device around the cluster, but the same nodes are used for
every write for one device.
- you can read all the measurements for one device for one day in one get

Super CF to hold all the measures for a day, with super columns for the device
{
20100810 : {
    device_id {
        date_and_time : value
    }
}
- this concentrates the write load for a single day on the same nodes for all devices. 
- may not be practicable if you have a lot of devices 
- you can read all the measurements for all devices for a single day in one get 

Standard CF to store each measurement as a row by itself.
{
device/date_and_time : {
    "timestamp" : date_and_time, 
    "measurement" : "the value"
    }
}
- this spreads every write around the cluster for every device and day
- You can then also write the values into aggregate CF's, say grouped by day or device as
above. If you ever want to build new aggregates you can use the raw data in this CF. 

Try out some different ideas and see how easy it is to do your reporting. 
 
This post from Cloud Kick may help 
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

Aaron

On 03 Aug, 2010,at 07:37 PM, Thorvaldsson Justus <justus.thorvaldsson@svenskaspel.se>
wrote:

It sounds to me that it's an good idea to use Cassandra in your case, I figure I help you
as we Europeans need to cooperate some even though I only worked with Cassandra for a month.
=)

1:
What is the query you want to use when charting the data? Use it to decide how to storage
and sort your data.
2:
Where is your row? You must model it correctly, I added my explanation here: http://x0613.orbbox.com/blog/662/8567/
(http://www.justus.st/)
SCF-ROW-SC-C
Or
CF-ROW-C
3:
There is some limitations:
2GB of data in a row in 0.6, 2 billion columns in 0.7.
And
A row must fit on a node.
4:
For my range-selections - I think I need the OrderPreservingPartitioner. Right?
I don't think you must but sort it by the time of measure. Why you do not need to is because
you always have an entire row on the same node, OrderPreservingPartitioner is regarding Row
Keys in order.
You got to check how to sort columns and supercolumns again. I haven't added my bookmarks
to the blog yet but http://www.sodeso.nl/?p=421
Was a good source for information I think. There is more on the same blog aswell.
5:
There is always alternate designs, you should not give up to early as it's the most important
decisions.
6:
Have a nice day Stefan

/Justus



-----Ursprungligt meddelande-----
Från: Stefan Kaufmann [mailto:staeff@gmail.com] 
Skickat: den 3 augusti 2010 09:21
Till: user@cassandra.apache.org
Ämne: Using Cassandra for storing measurement data

Dear Cassandra Users,

I'm quite new to Cassandra and I'm still trying to figure out, if I'm
on the right path for my requirements.
I like to explain my Cassandra design and hope to receive feedback, if
this would work.

I like to use Cassandra to store measurement data from several
devices. Each device every minute - so there will be about 500 000
Entries per device every year.
Following data has to be stored:
- device ID
- measurement Time (of course different to the Cassandra time-stamp)
- measurement value

Later, the data should be charted - so I need to select time-ranges
from a device.



My solution for is currently a super-column:
{
name: "device1",
value: {
// measurement timestamps..
1280819205: {name: "value", value: "10", timestamp: 123456789},
1280819305: {name: "value", value: "15", timestamp: 123456789},
1280819405: {name: "value", value: "10", timestamp: 123456789},
//there will be millions of entries
}
name: "device2",
value: {
// measurement timestamps..
1280819205: {name: "value", value: "20", timestamp: 123456789},
1280819305: {name: "value", value: "15", timestamp: 123456789},
1280819405: {name: "value", value: "20", timestamp: 123456789},
//there will be millions of entries
}
}

My questions:
My main concern is the huge amount of subcolumns I'm using. All the
examples of Cassandra in the web I saw, used those to store only a few
columns (like a user profile).
So would this work with millions of entries?

For my range-selections - I think I need the OrderPreservingPartitioner. Right?

Are there alternative designs? Maybe one without a Super-column? I
can't think of one..

I'm looking forward to some answers,
Thanks in advance,
Stefan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message