incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Solovyov <boris.solov...@gmail.com>
Subject Seeking suggestions for a use case
Date Tue, 12 Feb 2013 10:55:28 GMT
Hello list!

I have application with following characteristics:

   - data is time series, tens of millions of series at 1-sec granularity,
   like stock ticker data
   - values are timestamp, integer (uint64)
   - data is append only, never update
   - data don't write in distant past, maybe sometimes write 10 sec ago but
   not more
   - data is write mostly, like 99.9% write I think
   - most read will be of recent data, always in range of timestamps
   - data needs purge after some time, ex. 1 week

I consider to use Cassandra. No other existing database (HBase, Riak, etc)
seems well suited for this.

Questions:

   - Did I miss some others database that could work? Please suggest me if
   you know one.
   - What are benefits or drawbacks of leveled compaction for this workload?
   - Setting column TTL seems bad choice due to extra storage. Agree? Is
   efficient to run routine batch job to purge oldest data? Is there will be
   any gotcha with that (like fullscan of something instead of just oldest,
   maybe?)
   - Will column index beneficial? If reads are scans, does it matter, or
   is it just extra work and storage space to maintain, without much benefit
   especially since reads are rare?
   - How gc_grace_seconds impacts operations in this workload? Will purges
   of old data leave sstables mostly obsolete, rather than sparsely obsolete?
   I think they will. So, after purge, tombstones can be GC shortly, no need
   for default 10 days grace period. BUT, I read in docs that
   if gc_grace_seconds is short, then nodetool repair needs run quite often.
   Is that true? Why would that be needed in my use case?
   - Related question: is it sensible to set tombstone_threshold to 1.0 but
   tombstone_compaction_interval to something short, like 1 hour? I suppose
   this depends on whether I am correct that SSTables will be deleted
   entirely, instead of just getting sparse.
   - Should I disable row_cache_provider? It invalidates every row on
   update, right? I will be updating rows constantly, so it seems not
   benefitial.
   - Docs say "compaction_throughput_mb_per_sec" is per "entire system."
   Does that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble
   with periodic deletions of expired columns? Do I need to make sure my
   purges of old data are trickled out over time to avoid huge overhead of
   compaction? But in that case, SSTables will become sparsely deleted, right?
   And then re-compacted, which seems wasteful if the remaining data will soon
   be purged again and there will be another re-compaction. So this is
   partially why I asked about tombstone-threshold and compaction interval --
   I think is best if I can purge data in such a way that Cassandra never
   recompacts SsTables, but just realizes "oh, whole thing is dead, I can
   delete, no work needed." But I am not sure if my considered settings will
   have unintended consequence.
   - Finally, with proposed workload, will there be troubles with
   flush_larges_memtables_at and reduce_cache_capacity_to,
   reduce_cache_sizes_at? These are describe as "emergency measures" in docs.
   If my workload is edge case that could trigger bad emergency-measure
   behavior I hope you can say me that :-)

Many thanks!

Boris

Mime
View raw message