cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Haskin <de...@haskinferguson.net>
Subject Re: Appropriate use for Cassandra?
Date Thu, 06 May 2010 01:24:45 GMT
Hmm... I was actually thinking of the inverse of that: 20K rows (one
per entity), with one supercolumn per time-series sample... it would
be something like 700,000 supercolumns (1.5 years, to start with)
growing to maybe 2,400,000 supercolumns.

That may be an issue for our access path needs, however... and may not
even be possible at all: seems to me I've been reading that Cassandra
needs to be able to have an entire supercolumn in memory at once for
deserialization?

Thanks,

dwh


On Wed, May 5, 2010 at 7:47 AM, David Strauss <david@fourkitchens.com> wrote:
> Given that your current schema has ~18 small columns per row, adding a
> level by using supercolumns may make sense for you because the
> limitation of unserializing a whole supercolumn at once isn't going to
> be a problem for you.
>
> 20K supercolumns per row with ~18 small subcolumns each is completely
> reasonable. The (super)columns within each row will be ordered, and you
> can use the much-easier-to-administer RandomPartitioner.
>
> On 2010-05-05 11:22, Denis Haskin wrote:
>> David -- thanks for the thoughts.
>>
>> In re: your question
>>> Does the random partitioner support what you need?
>>
>> I guess my answer is "I'm not sure yet", but also my initial thought
>> was that we'd use the (or a) OrderPreservingPartitioner so that we
>> could use range scans and that rows for a given entity would be
>> co-located (if I'm understanding Cassandra's storage architecture
>> properly).  But that may be a naive approach.
>>
>> In our core data set, we have maybe 20,000 entities about which we are
>> storing time-series data (and its fairly well distributed across these
>> entities).  Occurs to me it's also possible to store a entity per row,
>> with the time-series data as (or in?) super columns (and maybe it
>> would make sense to break those out in column families by date range).
>>  I'd have to think through a little more what that might mean for our
>> secondary indexing needs.
>>
>> Thanks,
>>
>> dwh
>>
>>
>>
>> On Wed, May 5, 2010 at 1:16 AM, David Strauss <david@fourkitchens.com> wrote:
>>> On 2010-05-05 04:50, Denis Haskin wrote:
>>>> I've been reading everything I can get my hands on about Cassandra and
>>>> it sounds like a possibly very good framework for our data needs; I'm
>>>> about to take the plunge and do some prototyping, but I thought I'd
>>>> see if I can get a reality check here on whether it makes sense.
>>>>
>>>> Our schema should be fairly simple; we may only keep our original data
>>>> in Cassandra, and the rollups and analyzed results in a relational db
>>>> (although this is still open for discussion).
>>>
>>> This is what we do on some projects. This is a particularly nice
>>> strategy if the raw : aggregated ratio is really high or the raw data is
>>> bursty or highly volatile.
>>>
>>> Consider Hadoop integration for your aggregation needs.
>>>
>>>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>>>> Data is additive only; we would rarely, if ever, be deleting data.
>>>
>>> Cassandra loves you.
>>>
>>>> Our core data set will accumulate at somewhere between 14 and 27
>>>> million rows per day; we'll be starting with about a year and a half
>>>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>>>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>>>> per year, data only.  Not sure about the overhead yet.)
>>>>
>>>> Ideally we'd like to also have a cluster with our complete data set,
>>>> which is maybe 38 billion rows per year (we could live with less than
>>>> 5 years of that).
>>>>
>>>> I haven't really thought through what the schema's going to be; our
>>>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>>>> other retrieval paths we'll need to support as well.
>>>
>>> Generally, you do multiple retrieval paths through denormalization in
>>> Cassandra.
>>>
>>>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>>>
>>> Does the random partitioner support what you need?
>>>
>>> --
>>> David Strauss
>>>   | david@fourkitchens.com
>>> Four Kitchens
>>>   | http://fourkitchens.com
>>>   | +1 512 454 6659 [office]
>>>   | +1 512 870 8453 [direct]
>>>
>>>
>
>
> --
> David Strauss
>   | david@fourkitchens.com
>   | +1 512 577 5827 [mobile]
> Four Kitchens
>   | http://fourkitchens.com
>   | +1 512 454 6659 [office]
>   | +1 512 870 8453 [direct]
>
>



-- 
dwh

Mime
View raw message