Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE683EB90 for ; Tue, 12 Feb 2013 20:08:52 +0000 (UTC) Received: (qmail 97245 invoked by uid 500); 12 Feb 2013 20:08:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 97223 invoked by uid 500); 12 Feb 2013 20:08:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 97215 invoked by uid 99); 12 Feb 2013 20:08:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Feb 2013 20:08:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of boris.solovyov@gmail.com designates 209.85.214.178 as permitted sender) Received: from [209.85.214.178] (HELO mail-ob0-f178.google.com) (209.85.214.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Feb 2013 20:08:42 +0000 Received: by mail-ob0-f178.google.com with SMTP id wd20so493872obb.37 for ; Tue, 12 Feb 2013 12:08:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=J+UknKdGavVvm4mNSzC8vglpycwPrxu+Vi5hsMkk+BI=; b=VKSuU3AK7jCCPlBWehqWA+LFD5+mdun+v4O9NEYnsuDGGP4VeNazbwOOJaVohGNoVn UvZcPqhe2CDTVE/BYpOd1q4OKln+pe5src9gnCN0QGTCy+g5RJXh9FVi8kvFmE/yPwuX h4YVhMPW+55ktMl6mijjw5lntKahsEaw5UlV+vQkCKKwN8Fz4yYDC1ymrkpvIByI3Uhe wNySv1Micgo9CdK6CoEltVIkVACjjYD4QeOwfc8+jHkDwXlReKhLnJK1z4MmP73z6iZR 62ECpuHwo0NzyjMLIu60GSjGRRlhcaTO8u20q82Hp3l8POKx0BlT5Kia6eLWCkpqTLN1 fDiw== MIME-Version: 1.0 X-Received: by 10.182.54.102 with SMTP id i6mr14386459obp.67.1360699701333; Tue, 12 Feb 2013 12:08:21 -0800 (PST) Received: by 10.76.91.135 with HTTP; Tue, 12 Feb 2013 12:08:21 -0800 (PST) In-Reply-To: References: Date: Tue, 12 Feb 2013 15:08:21 -0500 Message-ID: Subject: Re: Seeking suggestions for a use case From: Boris Solovyov To: user Content-Type: multipart/alternative; boundary=14dae93a1145580f4d04d58c96db X-Virus-Checked: Checked by ClamAV on apache.org --14dae93a1145580f4d04d58c96db Content-Type: text/plain; charset=ISO-8859-1 Thanks. So in your use case, you actually keep parts of the same series in different rows, to keep the rows from getting too wide? I thought Cassandra worked OK with millions of columns per row. If I don't have to split a row into parts, that keep the data model simpler for me. (Otherwise, if I want to split row and reassemble in client code, I could just use RDBMS :-) On Tue, Feb 12, 2013 at 12:07 PM, Hiller, Dean wrote: > We are using cassandra for time series as well with PlayOrm. A guess is > we will be doing equal reads and writes on all the data going back 10 > years(currently in production we are write heavy right now). We have > 60,000 virtual tables (one table per sensor we read from and yes we have > that many sensors). We partition with PlayOrm partitioning one months > worth for each of the virtual tables. This gives us a wide row index into > each partition that playorm creates and the rest of the data varies > between very narrow tables (one column) and tables with around 20 columns. > It seems to be working extremely well so far and we run it on 6 cassandra > nodes as well. > > Anyways, thought I would share as perhaps it helps you understand your use > case. > > Later, > Dean > > On 2/12/13 8:08 AM, "Edward Capriolo" wrote: > > >Your use case is 100% on the money for Cassandra. But let me take a > >chance to slam the other NoSQLs. (not really slam but you know) > > > >Riak is a key-value store. It is not a column family store where a > >rowkey has a map of sorted values. This makes the time series more > >awkward as the time series has to span many rows, rather then one > >large row. > > > >HBase has similiar problems with time-series. On one hand if your > >rowkeys are series you get hotspots, if you columns are time series > >you run into two subtle issues. Last I check hbase's on disk format > >repeats the key each time (somewhat wasteful) > > > >key,column,value > >key,column,value > >key,column,value > > > >Also there are issues with really big rows, although they are dealt > >with in a similiar way to really wide rows in cassandra, just use time > >as part of the row key and the rows will not get that large. > > > >I do not think you need leveled compaction for an append only > >workload, although it might be helpful depending on how long you want > >to keep these rows. If you are not keeping them very long possibly > >leveled would keep the on disk size smaller. > > > >Column TTLs in cassandra do not require extra storage. It is a very > >efficient way to do this. Otherwise you have to scan through your data > >with some offline process and delete. > > > >Do not worry about gc_grace to much. The moral is because of > >distributed deletes some data lives on disk for a while after it is > >deleted. All this means is you need "some" more storage then just the > >space for your live data. > > > >Don't use row cache with wide rows REPEAT Don't use row cache with wide > >rows > > > >Compaction throughput is metered on each node (again not a setting to > >worry about > > > >if you are hitting flush_largest_memtables_at and > >reduce_cache_capacity_to it basically means your have over tuned or > >you do not have enough hardware. These are mostly emergency valves and > >if you are setup well these are not a factor. They are only around to > >relieve memory pressure to prevent the node from hitting a cycle where > >it is in GC more then it is in serving mode. > > > >Whew! > > > >Anyway. Nice to see that you are trying to understand the knobs, > >before kicking the tires. > > > >On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov > > wrote: > >> Hello list! > >> > >> I have application with following characteristics: > >> > >> data is time series, tens of millions of series at 1-sec granularity, > >>like > >> stock ticker data > >> values are timestamp, integer (uint64) > >> data is append only, never update > >> data don't write in distant past, maybe sometimes write 10 sec ago but > >>not > >> more > >> data is write mostly, like 99.9% write I think > >> most read will be of recent data, always in range of timestamps > >> data needs purge after some time, ex. 1 week > >> > >> I consider to use Cassandra. No other existing database (HBase, Riak, > >>etc) > >> seems well suited for this. > >> > >> Questions: > >> > >> Did I miss some others database that could work? Please suggest me if > >>you > >> know one. > >> What are benefits or drawbacks of leveled compaction for this workload? > >> Setting column TTL seems bad choice due to extra storage. Agree? Is > >> efficient to run routine batch job to purge oldest data? Is there will > >>be > >> any gotcha with that (like fullscan of something instead of just oldest, > >> maybe?) > >> Will column index beneficial? If reads are scans, does it matter, or is > >>it > >> just extra work and storage space to maintain, without much benefit > >> especially since reads are rare? > >> How gc_grace_seconds impacts operations in this workload? Will purges > >>of old > >> data leave sstables mostly obsolete, rather than sparsely obsolete? I > >>think > >> they will. So, after purge, tombstones can be GC shortly, no need for > >> default 10 days grace period. BUT, I read in docs that if > >>gc_grace_seconds > >> is short, then nodetool repair needs run quite often. Is that true? Why > >> would that be needed in my use case? > >> Related question: is it sensible to set tombstone_threshold to 1.0 but > >> tombstone_compaction_interval to something short, like 1 hour? I suppose > >> this depends on whether I am correct that SSTables will be deleted > >>entirely, > >> instead of just getting sparse. > >> Should I disable row_cache_provider? It invalidates every row on update, > >> right? I will be updating rows constantly, so it seems not benefitial. > >> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does > >> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with > >> periodic deletions of expired columns? Do I need to make sure my purges > >>of > >> old data are trickled out over time to avoid huge overhead of > >>compaction? > >> But in that case, SSTables will become sparsely deleted, right? And then > >> re-compacted, which seems wasteful if the remaining data will soon be > >>purged > >> again and there will be another re-compaction. So this is partially why > >>I > >> asked about tombstone-threshold and compaction interval -- I think is > >>best > >> if I can purge data in such a way that Cassandra never recompacts > >>SsTables, > >> but just realizes "oh, whole thing is dead, I can delete, no work > >>needed." > >> But I am not sure if my considered settings will have unintended > >> consequence. > >> Finally, with proposed workload, will there be troubles with > >> flush_larges_memtables_at and reduce_cache_capacity_to, > >> reduce_cache_sizes_at? These are describe as "emergency measures" in > >>docs. > >> If my workload is edge case that could trigger bad emergency-measure > >> behavior I hope you can say me that :-) > >> > >> Many thanks! > >> > >> Boris > > --14dae93a1145580f4d04d58c96db Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks. So in your use case, you actually keep parts of th= e same series in different rows, to keep the rows from getting too wide? I = thought Cassandra worked OK with millions of columns per row. If I don'= t have to split a row into parts, that keep the data model simpler for me. = (Otherwise, if I want to split row and reassemble in client code, I could j= ust use RDBMS :-)


On Tue, Feb 1= 2, 2013 at 12:07 PM, Hiller, Dean <Dean.Hiller@nrel.gov> = wrote:
We are using cassandra for time series as we= ll with PlayOrm. =A0A guess is
we will be doing equal reads and writes on all the data going back 10
years(currently in production we are write heavy right now). =A0We have
60,000 virtual tables (one table per sensor we read from and yes we have that many sensors). =A0We partition with PlayOrm partitioning one months worth for each of the virtual tables. =A0This gives us a wide row index int= o
each partition that playorm creates and the rest of the data varies
between very narrow tables (one column) and tables with around 20 columns.<= br> =A0It seems to be working extremely well so far and we run it on 6 cassandr= a
nodes as well.

Anyways, thought I would share as perhaps it helps you understand your use<= br> case.

Later,
Dean

On 2/12/13 8:08 AM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:

>Your use case is 100% on the money for Cassandra. But let me take a
>chance to slam the other NoSQLs. (not really slam but you know)
>
>Riak is a key-value store. It is not a column family store where a
>rowkey has a map of sorted values. This makes the time series more
>awkward as the time series has to span many rows, rather then one
>large row.
>
>HBase has similiar problems with time-series. On one hand if your
>rowkeys are series you get hotspots, if you columns are time series
>you run into two subtle issues. Last I check hbase's on disk format=
>repeats the key each time (somewhat wasteful)
>
>key,column,value
>key,column,value
>key,column,value
>
>Also there are issues with really big rows, although they are dealt
>with in a similiar way to really wide rows in cassandra, just use time<= br> >as part of the row key and the rows will not get that large.
>
>I do not think you need leveled compaction for an append only
>workload, although it might be helpful depending on how long you want >to keep these rows. If you are not keeping them very long possibly
>leveled would keep the on disk size smaller.
>
>Column TTLs in cassandra do not require extra storage. It is a very
>efficient way to do this. Otherwise you have to scan through your data<= br> >with some offline process and delete.
>
>Do not worry about gc_grace to much. The moral is because of
>distributed deletes some data lives on disk for a while after it is
>deleted. All this means is you need "some" more storage then = just the
>space for your live data.
>
>Don't use row cache with wide rows REPEAT Don't use row cache w= ith wide
>rows
>
>Compaction throughput is metered on each node (again not a setting to >worry about
>
>if you are hitting flush_largest_memtables_at and
>reduce_cache_capacity_to it basically means your have over tuned or
>you do not have enough hardware. These are mostly emergency valves and<= br> >if you are setup well these are not a factor. They are only around to >relieve memory pressure to prevent the node from hitting a cycle where<= br> >it is in GC more then it is in serving mode.
>
>Whew!
>
>Anyway. Nice to see that you are trying to understand the knobs,
>before kicking the tires.
>
>On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
><boris.solovyov@gmail.co= m> wrote:
>> Hello list!
>>
>> I have application with following characteristics:
>>
>> data is time series, tens of millions of series at 1-sec granulari= ty,
>>like
>> stock ticker data
>> values are timestamp, integer (uint64)
>> data is append only, never update
>> data don't write in distant past, maybe sometimes write 10 sec= ago but
>>not
>> more
>> data is write mostly, like 99.9% write I think
>> most read will be of recent data, always in range of timestamps >> data needs purge after some time, ex. 1 week
>>
>> I consider to use Cassandra. No other existing database (HBase, Ri= ak,
>>etc)
>> seems well suited for this.
>>
>> Questions:
>>
>> Did I miss some others database that could work? Please suggest me= if
>>you
>> know one.
>> What are benefits or drawbacks of leveled compaction for this work= load?
>> Setting column TTL seems bad choice due to extra storage. Agree? I= s
>> efficient to run routine batch job to purge oldest data? Is there = will
>>be
>> any gotcha with that (like fullscan of something instead of just o= ldest,
>> maybe?)
>> Will column index beneficial? If reads are scans, does it matter, = or is
>>it
>> just extra work and storage space to maintain, without much benefi= t
>> especially since reads are rare?
>> How gc_grace_seconds impacts operations in this workload? Will pur= ges
>>of old
>> data leave sstables mostly obsolete, rather than sparsely obsolete= ? I
>>think
>> they will. So, after purge, tombstones can be GC shortly, no need = for
>> default 10 days grace period. BUT, I read in docs that if
>>gc_grace_seconds
>> is short, then nodetool repair needs run quite often. Is that true= ? Why
>> would that be needed in my use case?
>> Related question: is it sensible to set tombstone_threshold to 1.0= but
>> tombstone_compaction_interval to something short, like 1 hour? I s= uppose
>> this depends on whether I am correct that SSTables will be deleted=
>>entirely,
>> instead of just getting sparse.
>> Should I disable row_cache_provider? It invalidates every row on u= pdate,
>> right? I will be updating rows constantly, so it seems not benefit= ial.
>> Docs say "compaction_throughput_mb_per_sec" is per "= ;entire system." Does
>> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble= with
>> periodic deletions of expired columns? Do I need to make sure my p= urges
>>of
>> old data are trickled out over time to avoid huge overhead of
>>compaction?
>> But in that case, SSTables will become sparsely deleted, right? An= d then
>> re-compacted, which seems wasteful if the remaining data will soon= be
>>purged
>> again and there will be another re-compaction. So this is partiall= y why
>>I
>> asked about tombstone-threshold and compaction interval -- I think= is
>>best
>> if I can purge data in such a way that Cassandra never recompacts<= br> >>SsTables,
>> but just realizes "oh, whole thing is dead, I can delete, no = work
>>needed."
>> But I am not sure if my considered settings will have unintended >> consequence.
>> Finally, with proposed workload, will there be troubles with
>> flush_larges_memtables_at and reduce_cache_capacity_to,
>> reduce_cache_sizes_at? These are describe as "emergency measu= res" in
>>docs.
>> If my workload is edge case that could trigger bad emergency-measu= re
>> behavior I hope you can say me that :-)
>>
>> Many thanks!
>>
>> Boris


--14dae93a1145580f4d04d58c96db--