Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: zodiak@fifth-aeon.net
In-Reply-To: <CD71D10E.24421%Dean.Hiller@nrel.gov>
References: 
 <57C7C3CBDCB04F45A57AEC4CB21C0CCD1DB64478@mbx024-e1-nj-6.exch024.domain.local>
	<CD71D10E.24421%Dean.Hiller@nrel.gov>
Date: Fri, 22 Mar 2013 09:42:34 -0700
Message-ID: 
 <CABikBcjVt+Fq-JTmrnG8VjYggj4Wjy4+WBn02wHjA1EbHkvWCQ@mail.gmail.com>
Subject: Re: High disk I/O during reads
From: Jon Scarborough <jon@fifth-aeon.net>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf3079c100622bd704d8862466

--20cf3079c100622bd704d8862466
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Key distribution across probably varies a lot from row to row in our case.
Most reads would probably only need to look at a few SSTables, a few might
need to look at more.

I don't yet have a deep understanding of C* internals, but I would imagine
even the more expensive use cases would involve something like this:

1) Check the index for each SSTable to determine if part of the row is
there.
2) Look at the endpoints of the slice to determine if the data in a
particular SSTable is relevant to the query.
3) Read the chunks of those SSTables, working backwards from the end of the
slice until enough columns have been read to satisfy the limit clause in
the query.

So I would have guessed that even the more expensive queries on wide rows
typically wouldn't need to read more than a few hundred KB from disk to do
all that.  Seems like I'm missing something major.

Here's the complete CF definition, including compression settings:

CREATE COLUMNFAMILY conversation_text_message (
  conversation_key bigint PRIMARY KEY
) WITH
  comment=3D'' AND

comparator=3D'CompositeType(org.apache.cassandra.db.marshal.DateType,org.ap=
ache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiTyp=
e,org.apache.cassandra.db.marshal.AsciiType)'
AND
  read_repair_chance=3D0.100000 AND
  gc_grace_seconds=3D864000 AND
  default_validation=3Dtext AND
  min_compaction_threshold=3D4 AND
  max_compaction_threshold=3D32 AND
  replicate_on_write=3DTrue AND
  compaction_strategy_class=3D'SizeTieredCompactionStrategy' AND

compression_parameters:sstable_compression=3D'org.apache.cassandra.io.compr=
ess.SnappyCompressor';

Much thanks for any additional ideas.

-Jon


On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean <Dean.Hiller@nrel.gov> wrote:

> Did you mean to ask "are 'all' your keys spread across all SSTables"?  I
> am guessing at your intention.
>
> I mean I would very well hope my keys are spread across all sstables or
> otherwise that sstable should not be there as he has no keys in it ;).
>
> And I know we had HUGE disk size from the duplication in our sstables on
> size-tiered compaction=85.we never ran a major compaction but after we
> switched to LCS, we went from 300G to some 120G or something like that
> which was nice.  We only have 300 data point posts / second so not an
> extreme write load on 6 nodes as well though these posts causes read to
> check authorization and such of our system.
>
> Dean
>
> From: Kanwar Sangha <kanwar@mavenir.com<mailto:kanwar@mavenir.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Friday, March 22, 2013 8:38 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: RE: High disk I/O during reads
>
> Are your Keys spread across all SSTables ? That will cause every sstable
> read which will increase the I/O.
>
> What compaction are you using ?
>
> From: zodiak@fifth-aeon.net<mailto:zodiak@fifth-aeon.net> [mailto:
> zodiak@fifth-aeon.net] On Behalf Of Jon Scarborough
> Sent: 21 March 2013 23:00
> To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
> Subject: High disk I/O during reads
>
> Hello,
>
> We've had a 5-node C* cluster (version 1.1.0) running for several months.
>  Up until now we've mostly been writing data, but now we're starting to
> service more read traffic.  We're seeing far more disk I/O to service the=
se
> reads than I would have anticipated.
>
> The CF being queried consists of chat messages.  Each row represents a
> conversation between two people.  Each column represents a message.  The
> column key is composite, consisting of the message date and a few other
> bits of information.  The CF is using compression.
>
> The query is looking for a maximum of 50 messages between two dates, in
> reverse order.  Usually the two dates used as endpoints are 30 days ago a=
nd
> the current time.  The query in Astyanax looks like this:
>
>             ColumnList<ConversationTextMessageKey> result =3D
> keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE)
>                     .setConsistencyLevel(ConsistencyLevel.CL_QUORUM)
>                     .getKey(conversationKey)
>                     .withColumnRange(
>                             textMessageSerializer.makeEndpoint(endDate,
> Equality.LESS_THAN).toBytes(),
>                             textMessageSerializer.makeEndpoint(startDate,
> Equality.GREATER_THAN_EQUALS).toBytes(),
>                             true,
>                             maxMessages)
>                     .execute()
>                     .getResult();
>
> We're currently servicing around 30 of these queries per second.
>
> Here's what the cfstats for the CF look like:
>
>         Column Family: conversation_text_message
>         SSTable count: 15
>         Space used (live): 211762982685
>         Space used (total): 211762982685
>         Number of Keys (estimate): 330118528
>         Memtable Columns Count: 68063
>         Memtable Data Size: 53093938
>         Memtable Switch Count: 9743
>         Read Count: 4313344
>         Read Latency: 118.831 ms.
>         Write Count: 817876950
>         Write Latency: 0.023 ms.
>         Pending Tasks: 0
>         Bloom Filter False Postives: 6055
>         Bloom Filter False Ratio: 0.00260
>         Bloom Filter Space Used: 686266048
>         Compacted row minimum size: 87
>         Compacted row maximum size: 14530764
>         Compacted row mean size: 1186
>
> On the C* nodes, iostat output like this is typical, and can spike to be
> much worse:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.91    0.00    2.08   30.66    0.50   64.84
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> xvdap1            0.13         0.00         1.07          0         16
> xvdb            474.20     13524.53        25.33     202868        380
> xvdc            469.87     13455.73        30.40     201836        456
> md0             972.13     26980.27        55.73     404704        836
>
> Any thoughts on what could be causing read I/O to the disk from these
> queries?
>
> Much thanks!
>
> -Jon
>

--20cf3079c100622bd704d8862466
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Key distribution across probably varies a lot from row to row in our case.=
=A0 Most reads would probably only need to look at a few SSTables, a few mi=
ght need to look at more.<br><br>I don&#39;t yet have a deep understanding =
of C* internals, but I would imagine even the more expensive use cases woul=
d involve something like this:<br>
<br>1) Check the index for each SSTable to determine if part of the row is =
there.<br>2) Look at the endpoints of the slice to determine if the data in=
 a particular SSTable is relevant to the query.<br>3) Read the chunks of th=
ose SSTables, working backwards from the end of the slice until enough colu=
mns have been read to satisfy the limit clause in the query.<br>
<br>So I would have guessed that even the more expensive queries on wide ro=
ws typically wouldn&#39;t need to read more than a few hundred KB from disk=
 to do all that.=A0 Seems like I&#39;m missing something major.<br><br>Here=
&#39;s the complete CF definition, including compression settings:<br>
<br>CREATE COLUMNFAMILY conversation_text_message (<br>=A0 conversation_key=
 bigint PRIMARY KEY<br>) WITH<br>=A0 comment=3D&#39;&#39; AND<br>=A0 compar=
ator=3D&#39;CompositeType(org.apache.cassandra.db.marshal.DateType,org.apac=
he.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,=
org.apache.cassandra.db.marshal.AsciiType)&#39; AND<br>
=A0 read_repair_chance=3D0.100000 AND<br>=A0 gc_grace_seconds=3D864000 AND<=
br>=A0 default_validation=3Dtext AND<br>=A0 min_compaction_threshold=3D4 AN=
D<br>=A0 max_compaction_threshold=3D32 AND<br>=A0 replicate_on_write=3DTrue=
 AND<br>=A0 compaction_strategy_class=3D&#39;SizeTieredCompactionStrategy&#=
39; AND<br>
=A0 compression_parameters:sstable_compression=3D&#39;org.apache.cassandra.=
io.compress.SnappyCompressor&#39;;<br><br>Much thanks for any additional id=
eas.<br><br>-Jon<br><br><br><div class=3D"gmail_quote">On Fri, Mar 22, 2013=
 at 8:15 AM, Hiller, Dean <span dir=3D"ltr">&lt;<a href=3D"mailto:Dean.Hill=
er@nrel.gov" target=3D"_blank">Dean.Hiller@nrel.gov</a>&gt;</span> wrote:<b=
r>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Did you mean to ask &quot;are &#39;all&#39; =
your keys spread across all SSTables&quot;? =A0I am guessing at your intent=
ion.<br>

<br>
I mean I would very well hope my keys are spread across all sstables or oth=
erwise that sstable should not be there as he has no keys in it ;).<br>
<br>
And I know we had HUGE disk size from the duplication in our sstables on si=
ze-tiered compaction=85.we never ran a major compaction but after we switch=
ed to LCS, we went from 300G to some 120G or something like that which was =
nice. =A0We only have 300 data point posts / second so not an extreme write=
 load on 6 nodes as well though these posts causes read to check authorizat=
ion and such of our system.<br>

<br>
Dean<br>
<br>
From: Kanwar Sangha &lt;<a href=3D"mailto:kanwar@mavenir.com">kanwar@maveni=
r.com</a>&lt;mailto:<a href=3D"mailto:kanwar@mavenir.com">kanwar@mavenir.co=
m</a>&gt;&gt;<br>
Reply-To: &quot;<a href=3D"mailto:user@cassandra.apache.org">user@cassandra=
.apache.org</a>&lt;mailto:<a href=3D"mailto:user@cassandra.apache.org">user=
@cassandra.apache.org</a>&gt;&quot; &lt;<a href=3D"mailto:user@cassandra.ap=
ache.org">user@cassandra.apache.org</a>&lt;mailto:<a href=3D"mailto:user@ca=
ssandra.apache.org">user@cassandra.apache.org</a>&gt;&gt;<br>

Date: Friday, March 22, 2013 8:38 AM<br>
To: &quot;<a href=3D"mailto:user@cassandra.apache.org">user@cassandra.apach=
e.org</a>&lt;mailto:<a href=3D"mailto:user@cassandra.apache.org">user@cassa=
ndra.apache.org</a>&gt;&quot; &lt;<a href=3D"mailto:user@cassandra.apache.o=
rg">user@cassandra.apache.org</a>&lt;mailto:<a href=3D"mailto:user@cassandr=
a.apache.org">user@cassandra.apache.org</a>&gt;&gt;<br>

Subject: RE: High disk I/O during reads<br>
<div class=3D"im"><br>
Are your Keys spread across all SSTables ? That will cause every sstable re=
ad which will increase the I/O.<br>
<br>
What compaction are you using ?<br>
<br>
</div>From: <a href=3D"mailto:zodiak@fifth-aeon.net">zodiak@fifth-aeon.net<=
/a>&lt;mailto:<a href=3D"mailto:zodiak@fifth-aeon.net">zodiak@fifth-aeon.ne=
t</a>&gt; [mailto:<a href=3D"mailto:zodiak@fifth-aeon.net">zodiak@fifth-aeo=
n.net</a>] On Behalf Of Jon Scarborough<br>

<div class=3D"im">Sent: 21 March 2013 23:00<br>
</div>To: <a href=3D"mailto:user@cassandra.apache.org">user@cassandra.apach=
e.org</a>&lt;mailto:<a href=3D"mailto:user@cassandra.apache.org">user@cassa=
ndra.apache.org</a>&gt;<br>
<div class=3D"HOEnZb"><div class=3D"h5">Subject: High disk I/O during reads=
<br>
<br>
Hello,<br>
<br>
We&#39;ve had a 5-node C* cluster (version 1.1.0) running for several month=
s. =A0Up until now we&#39;ve mostly been writing data, but now we&#39;re st=
arting to service more read traffic. =A0We&#39;re seeing far more disk I/O =
to service these reads than I would have anticipated.<br>

<br>
The CF being queried consists of chat messages. =A0Each row represents a co=
nversation between two people. =A0Each column represents a message. =A0The =
column key is composite, consisting of the message date and a few other bit=
s of information. =A0The CF is using compression.<br>

<br>
The query is looking for a maximum of 50 messages between two dates, in rev=
erse order. =A0Usually the two dates used as endpoints are 30 days ago and =
the current time. =A0The query in Astyanax looks like this:<br>
<br>
=A0 =A0 =A0 =A0 =A0 =A0 ColumnList&lt;ConversationTextMessageKey&gt; result=
 =3D keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE)<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .setConsistencyLevel(ConsistencyLev=
el.CL_QUORUM)<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .getKey(conversationKey)<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .withColumnRange(<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 textMessageSerializ=
er.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(),<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 textMessageSerializ=
er.makeEndpoint(startDate, Equality.GREATER_THAN_EQUALS).toBytes(),<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 true,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 maxMessages)<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .execute()<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .getResult();<br>
<br>
We&#39;re currently servicing around 30 of these queries per second.<br>
<br>
Here&#39;s what the cfstats for the CF look like:<br>
<br>
=A0 =A0 =A0 =A0 Column Family: conversation_text_message<br>
=A0 =A0 =A0 =A0 SSTable count: 15<br>
=A0 =A0 =A0 =A0 Space used (live): 211762982685<br>
=A0 =A0 =A0 =A0 Space used (total): 211762982685<br>
=A0 =A0 =A0 =A0 Number of Keys (estimate): 330118528<br>
=A0 =A0 =A0 =A0 Memtable Columns Count: 68063<br>
=A0 =A0 =A0 =A0 Memtable Data Size: 53093938<br>
=A0 =A0 =A0 =A0 Memtable Switch Count: 9743<br>
=A0 =A0 =A0 =A0 Read Count: 4313344<br>
=A0 =A0 =A0 =A0 Read Latency: 118.831 ms.<br>
=A0 =A0 =A0 =A0 Write Count: 817876950<br>
=A0 =A0 =A0 =A0 Write Latency: 0.023 ms.<br>
=A0 =A0 =A0 =A0 Pending Tasks: 0<br>
=A0 =A0 =A0 =A0 Bloom Filter False Postives: 6055<br>
=A0 =A0 =A0 =A0 Bloom Filter False Ratio: 0.00260<br>
=A0 =A0 =A0 =A0 Bloom Filter Space Used: 686266048<br>
=A0 =A0 =A0 =A0 Compacted row minimum size: 87<br>
=A0 =A0 =A0 =A0 Compacted row maximum size: 14530764<br>
=A0 =A0 =A0 =A0 Compacted row mean size: 1186<br>
<br>
On the C* nodes, iostat output like this is typical, and can spike to be mu=
ch worse:<br>
<br>
avg-cpu: =A0%user =A0 %nice %system %iowait =A0%steal =A0 %idle<br>
=A0 =A0 =A0 =A0 =A0 =A01.91 =A0 =A00.00 =A0 =A02.08 =A0 30.66 =A0 =A00.50 =
=A0 64.84<br>
<br>
Device: =A0 =A0 =A0 =A0 =A0 =A0tps =A0 =A0kB_read/s =A0 =A0kB_wrtn/s =A0 =
=A0kB_read =A0 =A0kB_wrtn<br>
xvdap1 =A0 =A0 =A0 =A0 =A0 =A00.13 =A0 =A0 =A0 =A0 0.00 =A0 =A0 =A0 =A0 1.0=
7 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 16<br>
xvdb =A0 =A0 =A0 =A0 =A0 =A0474.20 =A0 =A0 13524.53 =A0 =A0 =A0 =A025.33 =
=A0 =A0 202868 =A0 =A0 =A0 =A0380<br>
xvdc =A0 =A0 =A0 =A0 =A0 =A0469.87 =A0 =A0 13455.73 =A0 =A0 =A0 =A030.40 =
=A0 =A0 201836 =A0 =A0 =A0 =A0456<br>
md0 =A0 =A0 =A0 =A0 =A0 =A0 972.13 =A0 =A0 26980.27 =A0 =A0 =A0 =A055.73 =
=A0 =A0 404704 =A0 =A0 =A0 =A0836<br>
<br>
Any thoughts on what could be causing read I/O to the disk from these queri=
es?<br>
<br>
Much thanks!<br>
<br>
-Jon<br>
</div></div></blockquote></div><br>

--20cf3079c100622bd704d8862466--