cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Thomas <jthom...@gmail.com>
Subject Re: Interesting use case
Date Thu, 09 Jun 2016 10:12:29 GMT
The example I gave was for when N=1, if we need to save more values I
planned to just add more columns.

On Thu, Jun 9, 2016 at 12:51 AM, kurt Greaves <kurt@instaclustr.com> wrote:

> I would say it's probably due to a significantly larger number of
> partitions when using the overwrite method - but really you should be
> seeing similar performance unless one of the schemas ends up generating a
> lot more disk IO.
> If you're planning to read the last N values for an event at the same time
> the widerow schema would be better, otherwise reading N events using the
> overwrite schema will result in you hitting N partitions. You really need
> to take into account how you're going to read the data when you design a
> schema, not only how many writes you can push through.
>
> On 8 June 2016 at 19:02, John Thomas <jthom874@gmail.com> wrote:
>
>> We have a use case where we are storing event data for a given system and
>> only want to retain the last N values.  Storing extra values for some time,
>> as long as it isn’t too long, is fine but never less than N.  We can't use
>> TTLs to delete the data because we can't be sure how frequently events will
>> arrive and could end up losing everything.  Is there any built in mechanism
>> to accomplish this or a known pattern that we can follow?  The events will
>> be read and written at a pretty high frequency so the solution would have
>> to be performant and not fragile under stress.
>>
>>
>>
>> We’ve played with a schema that just has N distinct columns with one
>> value in each but have found overwrites seem to perform much poorer than
>> wide rows.  The use case we tested only required we store the most recent
>> value:
>>
>>
>>
>> CREATE TABLE eventyvalue_overwrite(
>>
>>     system_name text,
>>
>>     event_name text,
>>
>>     event_time timestamp,
>>
>>     event_value blob,
>>
>>     PRIMARY KEY (system_name,event_name))
>>
>>
>>
>> CREATE TABLE eventvalue_widerow (
>>
>>     system_name text,
>>
>>     event_name text,
>>
>>     event_time timestamp,
>>
>>     event_value blob,
>>
>>     PRIMARY KEY ((system_name, event_name), event_time))
>>
>>     WITH CLUSTERING ORDER BY (event_time DESC)
>>
>>
>>
>> We tested it against the DataStax AMI on EC2 with 6 nodes, replication 3,
>> write consistency 2, and default settings with a write only workload and
>> got 190K/s for wide row and 150K/s for overwrite.  Thinking through the
>> write path it seems the performance should be pretty similar, with probably
>> smaller sstables for the overwrite schema, can anyone explain the big
>> difference?
>>
>>
>>
>> The wide row solution is more complex in that it requires a separate
>> clean up thread that will handle deleting the extra values.  If that’s the
>> path we have to follow we’re thinking we’d add a bucket of some sort so
>> that we can delete an entire partition at a time after copying some values
>> forward, on the assumption that deleting the whole partition is much better
>> than deleting some slice of the partition.  Is that true?  Also, is there
>> any difference between setting a really short ttl and doing a delete?
>>
>>
>>
>> I know there are a lot of questions in there but we’ve been going back
>> and forth on this for a while and I’d really appreciate any help you could
>> give.
>>
>>
>>
>> Thanks,
>>
>> John
>>
>
>
>
> --
> Kurt Greaves
> kurt@instaclustr.com
> www.instaclustr.com
>

Mime
View raw message