Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding;
  b=ShtCpzPv/PcnYYq5G42NrQraiceAvhtgau3YRbfnqyQqTnKQUMMsom4eN9B6T7JJHrAUSHP5GSIDjyNMAk+fCMk4IHyyIjuVBR9/3HLAhPz+aM/ltqTxtvFi2bdblNiVO6QQlkMJePFFpEUsFMzfInj0u8wEQHnRPfwqBXD/NzY=;
Message-ID: <1363974224.25103.GenericBBA@web160906.mail.bf1.yahoo.com>
Date: Fri, 22 Mar 2013 10:43:44 -0700 (PDT)
From: Wei Zhu <wz1975@yahoo.com>
Reply-To: Wei Zhu <wz1975@yahoo.com>
Subject: Re: High disk I/O during reads
To: user@cassandra.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

According to your cfstats, read latency is over 100 ms which is really real=
ly slow. I am seeing less than 3ms reads for my cluster which is on SSD. Ca=
n you also check the nodetool cfhistorgram, it tells you more about the num=
ber of SSTable involved and read/write latency. Somtimes average doesn't te=
ll you the whole storey. =0AAlso check your nodetool tpstats, are there a l=
ot dropped reads?=0A=0A-Wei=0A----- Original Message -----=0AFrom: "Jon Sca=
rborough" <jon@fifth-aeon.net>=0ATo: user@cassandra.apache.org=0ASent: Frid=
ay, March 22, 2013 9:42:34 AM=0ASubject: Re: High disk I/O during reads=0A=
=0AKey distribution across probably varies a lot from row to row in our cas=
e. Most reads would probably only need to look at a few SSTables, a few mig=
ht need to look at more. =0A=0AI don't yet have a deep understanding of C* =
internals, but I would imagine even the more expensive use cases would invo=
lve something like this: =0A=0A1) Check the index for each SSTable to deter=
mine if part of the row is there. =0A2) Look at the endpoints of the slice =
to determine if the data in a particular SSTable is relevant to the query. =
=0A3) Read the chunks of those SSTables, working backwards from the end of =
the slice until enough columns have been read to satisfy the limit clause i=
n the query. =0A=0ASo I would have guessed that even the more expensive que=
ries on wide rows typically wouldn't need to read more than a few hundred K=
B from disk to do all that. Seems like I'm missing something major. =0A=0AH=
ere's the complete CF definition, including compression settings: =0A=0ACRE=
ATE COLUMNFAMILY conversation_text_message ( =0Aconversation_key bigint PRI=
MARY KEY =0A) WITH =0Acomment=3D'' AND =0Acomparator=3D'CompositeType(org.a=
pache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongTyp=
e,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal=
.AsciiType)' AND =0Aread_repair_chance=3D0.100000 AND =0Agc_grace_seconds=
=3D864000 AND =0Adefault_validation=3Dtext AND =0Amin_compaction_threshold=
=3D4 AND =0Amax_compaction_threshold=3D32 AND =0Areplicate_on_write=3DTrue =
AND =0Acompaction_strategy_class=3D'SizeTieredCompactionStrategy' AND =0Aco=
mpression_parameters:sstable_compression=3D'org.apache.cassandra.io.compres=
s.SnappyCompressor'; =0A=0AMuch thanks for any additional ideas. =0A=0A-Jon=
 =0A=0A=0A=0AOn Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean < Dean.Hiller@nr=
el.gov > wrote: =0A=0A=0ADid you mean to ask "are 'all' your keys spread ac=
ross all SSTables"? I am guessing at your intention. =0A=0AI mean I would v=
ery well hope my keys are spread across all sstables or otherwise that ssta=
ble should not be there as he has no keys in it ;). =0A=0AAnd I know we had=
 HUGE disk size from the duplication in our sstables on size-tiered compact=
ion=E2=80=A6.we never ran a major compaction but after we switched to LCS, =
we went from 300G to some 120G or something like that which was nice. We on=
ly have 300 data point posts / second so not an extreme write load on 6 nod=
es as well though these posts causes read to check authorization and such o=
f our system. =0A=0ADean =0A=0AFrom: Kanwar Sangha < kanwar@mavenir.com <ma=
ilto: kanwar@mavenir.com >> =0AReply-To: " user@cassandra.apache.org <mailt=
o: user@cassandra.apache.org >" < user@cassandra.apache.org <mailto: user@c=
assandra.apache.org >> =0ADate: Friday, March 22, 2013 8:38 AM =0ATo: " use=
r@cassandra.apache.org <mailto: user@cassandra.apache.org >" < user@cassand=
ra.apache.org <mailto: user@cassandra.apache.org >> =0ASubject: RE: High di=
sk I/O during reads =0A=0A=0AAre your Keys spread across all SSTables ? Tha=
t will cause every sstable read which will increase the I/O. =0A=0AWhat com=
paction are you using ? =0A=0AFrom: zodiak@fifth-aeon.net <mailto: zodiak@f=
ifth-aeon.net > [mailto: zodiak@fifth-aeon.net ] On Behalf Of Jon Scarborou=
gh =0A=0ASent: 21 March 2013 23:00 =0ATo: user@cassandra.apache.org <mailto=
: user@cassandra.apache.org > =0A=0A=0ASubject: High disk I/O during reads =
=0A=0AHello, =0A=0AWe've had a 5-node C* cluster (version 1.1.0) running fo=
r several months. Up until now we've mostly been writing data, but now we'r=
e starting to service more read traffic. We're seeing far more disk I/O to =
service these reads than I would have anticipated. =0A=0AThe CF being queri=
ed consists of chat messages. Each row represents a conversation between tw=
o people. Each column represents a message. The column key is composite, co=
nsisting of the message date and a few other bits of information. The CF is=
 using compression. =0A=0AThe query is looking for a maximum of 50 messages=
 between two dates, in reverse order. Usually the two dates used as endpoin=
ts are 30 days ago and the current time. The query in Astyanax looks like t=
his: =0A=0AColumnList<ConversationTextMessageKey> result =3D keyspace.prepa=
reQuery(CF_CONVERSATION_TEXT_MESSAGE) =0A.setConsistencyLevel(ConsistencyLe=
vel.CL_QUORUM) =0A.getKey(conversationKey) =0A.withColumnRange( =0AtextMess=
ageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), =0AtextM=
essageSerializer.makeEndpoint(startDate, Equality.GREATER_THAN_EQUALS).toBy=
tes(), =0Atrue, =0AmaxMessages) =0A.execute() =0A.getResult(); =0A=0AWe're =
currently servicing around 30 of these queries per second. =0A=0AHere's wha=
t the cfstats for the CF look like: =0A=0AColumn Family: conversation_text_=
message =0ASSTable count: 15 =0ASpace used (live): 211762982685 =0ASpace us=
ed (total): 211762982685 =0ANumber of Keys (estimate): 330118528 =0AMemtabl=
e Columns Count: 68063 =0AMemtable Data Size: 53093938 =0AMemtable Switch C=
ount: 9743 =0ARead Count: 4313344 =0ARead Latency: 118.831 ms. =0AWrite Cou=
nt: 817876950 =0AWrite Latency: 0.023 ms. =0APending Tasks: 0 =0ABloom Filt=
er False Postives: 6055 =0ABloom Filter False Ratio: 0.00260 =0ABloom Filte=
r Space Used: 686266048 =0ACompacted row minimum size: 87 =0ACompacted row =
maximum size: 14530764 =0ACompacted row mean size: 1186 =0A=0AOn the C* nod=
es, iostat output like this is typical, and can spike to be much worse: =0A=
=0Aavg-cpu: %user %nice %system %iowait %steal %idle =0A1.91 0.00 2.08 30.6=
6 0.50 64.84 =0A=0ADevice: tps kB_read/s kB_wrtn/s kB_read kB_wrtn =0Axvdap=
1 0.13 0.00 1.07 0 16 =0Axvdb 474.20 13524.53 25.33 202868 380 =0Axvdc 469.=
87 13455.73 30.40 201836 456 =0Amd0 972.13 26980.27 55.73 404704 836 =0A=0A=
Any thoughts on what could be causing read I/O to the disk from these queri=
es? =0A=0AMuch thanks! =0A=0A-Jon =0A=0A