Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of boris.solovyov@gmail.com
 designates 209.85.214.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CD3FC3FC.1F2D2%Dean.Hiller@nrel.gov>
References: 
 <CAENxBwzF+h0a=Jc5U3Lj8ZY7N=03oN7HyVJHTB0U56tUHSWDNg@mail.gmail.com>
	<CD3FC3FC.1F2D2%Dean.Hiller@nrel.gov>
Date: Tue, 12 Feb 2013 15:08:21 -0500
Message-ID: 
 <CADrLAEP31uB629e3q4qkt4kbxuTapee3Fsi7i+jsM_eM7fq23w@mail.gmail.com>
Subject: Re: Seeking suggestions for a use case
From: Boris Solovyov <boris.solovyov@gmail.com>
To: user <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=14dae93a1145580f4d04d58c96db

--14dae93a1145580f4d04d58c96db
Content-Type: text/plain; charset=ISO-8859-1

Thanks. So in your use case, you actually keep parts of the same series in
different rows, to keep the rows from getting too wide? I thought Cassandra
worked OK with millions of columns per row. If I don't have to split a row
into parts, that keep the data model simpler for me. (Otherwise, if I want
to split row and reassemble in client code, I could just use RDBMS :-)


On Tue, Feb 12, 2013 at 12:07 PM, Hiller, Dean <Dean.Hiller@nrel.gov> wrote:

> We are using cassandra for time series as well with PlayOrm.  A guess is
> we will be doing equal reads and writes on all the data going back 10
> years(currently in production we are write heavy right now).  We have
> 60,000 virtual tables (one table per sensor we read from and yes we have
> that many sensors).  We partition with PlayOrm partitioning one months
> worth for each of the virtual tables.  This gives us a wide row index into
> each partition that playorm creates and the rest of the data varies
> between very narrow tables (one column) and tables with around 20 columns.
>  It seems to be working extremely well so far and we run it on 6 cassandra
> nodes as well.
>
> Anyways, thought I would share as perhaps it helps you understand your use
> case.
>
> Later,
> Dean
>
> On 2/12/13 8:08 AM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:
>
> >Your use case is 100% on the money for Cassandra. But let me take a
> >chance to slam the other NoSQLs. (not really slam but you know)
> >
> >Riak is a key-value store. It is not a column family store where a
> >rowkey has a map of sorted values. This makes the time series more
> >awkward as the time series has to span many rows, rather then one
> >large row.
> >
> >HBase has similiar problems with time-series. On one hand if your
> >rowkeys are series you get hotspots, if you columns are time series
> >you run into two subtle issues. Last I check hbase's on disk format
> >repeats the key each time (somewhat wasteful)
> >
> >key,column,value
> >key,column,value
> >key,column,value
> >
> >Also there are issues with really big rows, although they are dealt
> >with in a similiar way to really wide rows in cassandra, just use time
> >as part of the row key and the rows will not get that large.
> >
> >I do not think you need leveled compaction for an append only
> >workload, although it might be helpful depending on how long you want
> >to keep these rows. If you are not keeping them very long possibly
> >leveled would keep the on disk size smaller.
> >
> >Column TTLs in cassandra do not require extra storage. It is a very
> >efficient way to do this. Otherwise you have to scan through your data
> >with some offline process and delete.
> >
> >Do not worry about gc_grace to much. The moral is because of
> >distributed deletes some data lives on disk for a while after it is
> >deleted. All this means is you need "some" more storage then just the
> >space for your live data.
> >
> >Don't use row cache with wide rows REPEAT Don't use row cache with wide
> >rows
> >
> >Compaction throughput is metered on each node (again not a setting to
> >worry about
> >
> >if you are hitting flush_largest_memtables_at and
> >reduce_cache_capacity_to it basically means your have over tuned or
> >you do not have enough hardware. These are mostly emergency valves and
> >if you are setup well these are not a factor. They are only around to
> >relieve memory pressure to prevent the node from hitting a cycle where
> >it is in GC more then it is in serving mode.
> >
> >Whew!
> >
> >Anyway. Nice to see that you are trying to understand the knobs,
> >before kicking the tires.
> >
> >On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
> ><boris.solovyov@gmail.com> wrote:
> >> Hello list!
> >>
> >> I have application with following characteristics:
> >>
> >> data is time series, tens of millions of series at 1-sec granularity,
> >>like
> >> stock ticker data
> >> values are timestamp, integer (uint64)
> >> data is append only, never update
> >> data don't write in distant past, maybe sometimes write 10 sec ago but
> >>not
> >> more
> >> data is write mostly, like 99.9% write I think
> >> most read will be of recent data, always in range of timestamps
> >> data needs purge after some time, ex. 1 week
> >>
> >> I consider to use Cassandra. No other existing database (HBase, Riak,
> >>etc)
> >> seems well suited for this.
> >>
> >> Questions:
> >>
> >> Did I miss some others database that could work? Please suggest me if
> >>you
> >> know one.
> >> What are benefits or drawbacks of leveled compaction for this workload?
> >> Setting column TTL seems bad choice due to extra storage. Agree? Is
> >> efficient to run routine batch job to purge oldest data? Is there will
> >>be
> >> any gotcha with that (like fullscan of something instead of just oldest,
> >> maybe?)
> >> Will column index beneficial? If reads are scans, does it matter, or is
> >>it
> >> just extra work and storage space to maintain, without much benefit
> >> especially since reads are rare?
> >> How gc_grace_seconds impacts operations in this workload? Will purges
> >>of old
> >> data leave sstables mostly obsolete, rather than sparsely obsolete? I
> >>think
> >> they will. So, after purge, tombstones can be GC shortly, no need for
> >> default 10 days grace period. BUT, I read in docs that if
> >>gc_grace_seconds
> >> is short, then nodetool repair needs run quite often. Is that true? Why
> >> would that be needed in my use case?
> >> Related question: is it sensible to set tombstone_threshold to 1.0 but
> >> tombstone_compaction_interval to something short, like 1 hour? I suppose
> >> this depends on whether I am correct that SSTables will be deleted
> >>entirely,
> >> instead of just getting sparse.
> >> Should I disable row_cache_provider? It invalidates every row on update,
> >> right? I will be updating rows constantly, so it seems not benefitial.
> >> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
> >> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
> >> periodic deletions of expired columns? Do I need to make sure my purges
> >>of
> >> old data are trickled out over time to avoid huge overhead of
> >>compaction?
> >> But in that case, SSTables will become sparsely deleted, right? And then
> >> re-compacted, which seems wasteful if the remaining data will soon be
> >>purged
> >> again and there will be another re-compaction. So this is partially why
> >>I
> >> asked about tombstone-threshold and compaction interval -- I think is
> >>best
> >> if I can purge data in such a way that Cassandra never recompacts
> >>SsTables,
> >> but just realizes "oh, whole thing is dead, I can delete, no work
> >>needed."
> >> But I am not sure if my considered settings will have unintended
> >> consequence.
> >> Finally, with proposed workload, will there be troubles with
> >> flush_larges_memtables_at and reduce_cache_capacity_to,
> >> reduce_cache_sizes_at? These are describe as "emergency measures" in
> >>docs.
> >> If my workload is edge case that could trigger bad emergency-measure
> >> behavior I hope you can say me that :-)
> >>
> >> Many thanks!
> >>
> >> Boris
>
>

--14dae93a1145580f4d04d58c96db
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks. So in your use case, you actually keep parts of th=
e same series in different rows, to keep the rows from getting too wide? I =
thought Cassandra worked OK with millions of columns per row. If I don&#39;=
t have to split a row into parts, that keep the data model simpler for me. =
(Otherwise, if I want to split row and reassemble in client code, I could j=
ust use RDBMS :-)</div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, Feb 1=
2, 2013 at 12:07 PM, Hiller, Dean <span dir=3D"ltr">&lt;<a href=3D"mailto:D=
ean.Hiller@nrel.gov" target=3D"_blank">Dean.Hiller@nrel.gov</a>&gt;</span> =
wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">We are using cassandra for time series as we=
ll with PlayOrm. =A0A guess is<br>
we will be doing equal reads and writes on all the data going back 10<br>
years(currently in production we are write heavy right now). =A0We have<br>
60,000 virtual tables (one table per sensor we read from and yes we have<br=
>
that many sensors). =A0We partition with PlayOrm partitioning one months<br=
>
worth for each of the virtual tables. =A0This gives us a wide row index int=
o<br>
each partition that playorm creates and the rest of the data varies<br>
between very narrow tables (one column) and tables with around 20 columns.<=
br>
=A0It seems to be working extremely well so far and we run it on 6 cassandr=
a<br>
nodes as well.<br>
<br>
Anyways, thought I would share as perhaps it helps you understand your use<=
br>
case.<br>
<br>
Later,<br>
Dean<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On 2/12/13 8:08 AM, &quot;Edward Capriolo&quot; &lt;<a href=3D"mailto:edlin=
uxguru@gmail.com">edlinuxguru@gmail.com</a>&gt; wrote:<br>
<br>
&gt;Your use case is 100% on the money for Cassandra. But let me take a<br>
&gt;chance to slam the other NoSQLs. (not really slam but you know)<br>
&gt;<br>
&gt;Riak is a key-value store. It is not a column family store where a<br>
&gt;rowkey has a map of sorted values. This makes the time series more<br>
&gt;awkward as the time series has to span many rows, rather then one<br>
&gt;large row.<br>
&gt;<br>
&gt;HBase has similiar problems with time-series. On one hand if your<br>
&gt;rowkeys are series you get hotspots, if you columns are time series<br>
&gt;you run into two subtle issues. Last I check hbase&#39;s on disk format=
<br>
&gt;repeats the key each time (somewhat wasteful)<br>
&gt;<br>
&gt;key,column,value<br>
&gt;key,column,value<br>
&gt;key,column,value<br>
&gt;<br>
&gt;Also there are issues with really big rows, although they are dealt<br>
&gt;with in a similiar way to really wide rows in cassandra, just use time<=
br>
&gt;as part of the row key and the rows will not get that large.<br>
&gt;<br>
&gt;I do not think you need leveled compaction for an append only<br>
&gt;workload, although it might be helpful depending on how long you want<b=
r>
&gt;to keep these rows. If you are not keeping them very long possibly<br>
&gt;leveled would keep the on disk size smaller.<br>
&gt;<br>
&gt;Column TTLs in cassandra do not require extra storage. It is a very<br>
&gt;efficient way to do this. Otherwise you have to scan through your data<=
br>
&gt;with some offline process and delete.<br>
&gt;<br>
&gt;Do not worry about gc_grace to much. The moral is because of<br>
&gt;distributed deletes some data lives on disk for a while after it is<br>
&gt;deleted. All this means is you need &quot;some&quot; more storage then =
just the<br>
&gt;space for your live data.<br>
&gt;<br>
&gt;Don&#39;t use row cache with wide rows REPEAT Don&#39;t use row cache w=
ith wide<br>
&gt;rows<br>
&gt;<br>
&gt;Compaction throughput is metered on each node (again not a setting to<b=
r>
&gt;worry about<br>
&gt;<br>
&gt;if you are hitting flush_largest_memtables_at and<br>
&gt;reduce_cache_capacity_to it basically means your have over tuned or<br>
&gt;you do not have enough hardware. These are mostly emergency valves and<=
br>
&gt;if you are setup well these are not a factor. They are only around to<b=
r>
&gt;relieve memory pressure to prevent the node from hitting a cycle where<=
br>
&gt;it is in GC more then it is in serving mode.<br>
&gt;<br>
&gt;Whew!<br>
&gt;<br>
&gt;Anyway. Nice to see that you are trying to understand the knobs,<br>
&gt;before kicking the tires.<br>
&gt;<br>
&gt;On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov<br>
&gt;&lt;<a href=3D"mailto:boris.solovyov@gmail.com">boris.solovyov@gmail.co=
m</a>&gt; wrote:<br>
&gt;&gt; Hello list!<br>
&gt;&gt;<br>
&gt;&gt; I have application with following characteristics:<br>
&gt;&gt;<br>
&gt;&gt; data is time series, tens of millions of series at 1-sec granulari=
ty,<br>
&gt;&gt;like<br>
&gt;&gt; stock ticker data<br>
&gt;&gt; values are timestamp, integer (uint64)<br>
&gt;&gt; data is append only, never update<br>
&gt;&gt; data don&#39;t write in distant past, maybe sometimes write 10 sec=
 ago but<br>
&gt;&gt;not<br>
&gt;&gt; more<br>
&gt;&gt; data is write mostly, like 99.9% write I think<br>
&gt;&gt; most read will be of recent data, always in range of timestamps<br=
>
&gt;&gt; data needs purge after some time, ex. 1 week<br>
&gt;&gt;<br>
&gt;&gt; I consider to use Cassandra. No other existing database (HBase, Ri=
ak,<br>
&gt;&gt;etc)<br>
&gt;&gt; seems well suited for this.<br>
&gt;&gt;<br>
&gt;&gt; Questions:<br>
&gt;&gt;<br>
&gt;&gt; Did I miss some others database that could work? Please suggest me=
 if<br>
&gt;&gt;you<br>
&gt;&gt; know one.<br>
&gt;&gt; What are benefits or drawbacks of leveled compaction for this work=
load?<br>
&gt;&gt; Setting column TTL seems bad choice due to extra storage. Agree? I=
s<br>
&gt;&gt; efficient to run routine batch job to purge oldest data? Is there =
will<br>
&gt;&gt;be<br>
&gt;&gt; any gotcha with that (like fullscan of something instead of just o=
ldest,<br>
&gt;&gt; maybe?)<br>
&gt;&gt; Will column index beneficial? If reads are scans, does it matter, =
or is<br>
&gt;&gt;it<br>
&gt;&gt; just extra work and storage space to maintain, without much benefi=
t<br>
&gt;&gt; especially since reads are rare?<br>
&gt;&gt; How gc_grace_seconds impacts operations in this workload? Will pur=
ges<br>
&gt;&gt;of old<br>
&gt;&gt; data leave sstables mostly obsolete, rather than sparsely obsolete=
? I<br>
&gt;&gt;think<br>
&gt;&gt; they will. So, after purge, tombstones can be GC shortly, no need =
for<br>
&gt;&gt; default 10 days grace period. BUT, I read in docs that if<br>
&gt;&gt;gc_grace_seconds<br>
&gt;&gt; is short, then nodetool repair needs run quite often. Is that true=
? Why<br>
&gt;&gt; would that be needed in my use case?<br>
&gt;&gt; Related question: is it sensible to set tombstone_threshold to 1.0=
 but<br>
&gt;&gt; tombstone_compaction_interval to something short, like 1 hour? I s=
uppose<br>
&gt;&gt; this depends on whether I am correct that SSTables will be deleted=
<br>
&gt;&gt;entirely,<br>
&gt;&gt; instead of just getting sparse.<br>
&gt;&gt; Should I disable row_cache_provider? It invalidates every row on u=
pdate,<br>
&gt;&gt; right? I will be updating rows constantly, so it seems not benefit=
ial.<br>
&gt;&gt; Docs say &quot;compaction_throughput_mb_per_sec&quot; is per &quot=
;entire system.&quot; Does<br>
&gt;&gt; that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble=
 with<br>
&gt;&gt; periodic deletions of expired columns? Do I need to make sure my p=
urges<br>
&gt;&gt;of<br>
&gt;&gt; old data are trickled out over time to avoid huge overhead of<br>
&gt;&gt;compaction?<br>
&gt;&gt; But in that case, SSTables will become sparsely deleted, right? An=
d then<br>
&gt;&gt; re-compacted, which seems wasteful if the remaining data will soon=
 be<br>
&gt;&gt;purged<br>
&gt;&gt; again and there will be another re-compaction. So this is partiall=
y why<br>
&gt;&gt;I<br>
&gt;&gt; asked about tombstone-threshold and compaction interval -- I think=
 is<br>
&gt;&gt;best<br>
&gt;&gt; if I can purge data in such a way that Cassandra never recompacts<=
br>
&gt;&gt;SsTables,<br>
&gt;&gt; but just realizes &quot;oh, whole thing is dead, I can delete, no =
work<br>
&gt;&gt;needed.&quot;<br>
&gt;&gt; But I am not sure if my considered settings will have unintended<b=
r>
&gt;&gt; consequence.<br>
&gt;&gt; Finally, with proposed workload, will there be troubles with<br>
&gt;&gt; flush_larges_memtables_at and reduce_cache_capacity_to,<br>
&gt;&gt; reduce_cache_sizes_at? These are describe as &quot;emergency measu=
res&quot; in<br>
&gt;&gt;docs.<br>
&gt;&gt; If my workload is edge case that could trigger bad emergency-measu=
re<br>
&gt;&gt; behavior I hope you can say me that :-)<br>
&gt;&gt;<br>
&gt;&gt; Many thanks!<br>
&gt;&gt;<br>
&gt;&gt; Boris<br>
<br>
</div></div></blockquote></div><br></div>

--14dae93a1145580f4d04d58c96db--