Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9608AFFB6 for ; Fri, 22 Mar 2013 17:44:14 +0000 (UTC) Received: (qmail 64240 invoked by uid 500); 22 Mar 2013 17:44:12 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 64212 invoked by uid 500); 22 Mar 2013 17:44:12 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 64204 invoked by uid 99); 22 Mar 2013 17:44:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Mar 2013 17:44:12 +0000 X-ASF-Spam-Status: No, hits=0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLYTO_END_DIGIT,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.212.189] (HELO nm30.bullet.mail.bf1.yahoo.com) (98.139.212.189) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Mar 2013 17:44:06 +0000 Received: from [98.139.212.147] by nm30.bullet.mail.bf1.yahoo.com with NNFMP; 22 Mar 2013 17:43:45 -0000 Received: from [98.139.212.200] by tm4.bullet.mail.bf1.yahoo.com with NNFMP; 22 Mar 2013 17:43:45 -0000 Received: from [127.0.0.1] by omp1009.mail.bf1.yahoo.com with NNFMP; 22 Mar 2013 17:43:45 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 188281.959.bm@omp1009.mail.bf1.yahoo.com Received: (qmail 50061 invoked by uid 60001); 22 Mar 2013 17:43:44 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1363974224; bh=oQWFZRTMrXT3rwC9+CLz6gkV6xQs8m/tkHlMQHe9LLQ=; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=LBllTDctyfeJOoEFSxcogoBvL3V+W9GSBiFrXYndY8RTNqxY/1TdZUeUefMix903Kt6HN+CNqnc3MPcKgADMFF0vJEcoWW1pnskXqfXdeiTh5sWAuObsI7UnWjP3rLhRBG4fQJJJcdcSgF2Lm488mbtaZsksK+aFl97vQkocki8= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=ShtCpzPv/PcnYYq5G42NrQraiceAvhtgau3YRbfnqyQqTnKQUMMsom4eN9B6T7JJHrAUSHP5GSIDjyNMAk+fCMk4IHyyIjuVBR9/3HLAhPz+aM/ltqTxtvFi2bdblNiVO6QQlkMJePFFpEUsFMzfInj0u8wEQHnRPfwqBXD/NzY=; X-YMail-OSG: AbQ3AOoVM1kOmZ3O8ctJqJvXYUCnINhN8_zIiohBrjZTvIY cfp.NybceJipPuWwxc5B6RJPZW9PAd.Ep_ScanlXirOAjIkqlMeds71VFIUh go.ubEm4AXSakj9SZvk273IsrfKOlsTGVMJ36BR6WiO1ols1etExCdWPRrJO weOdL4EV_3Ky9enLr3.umT3_JMSqBRPs47qz1PAGI2MGImrsYFYsT111l61Y fyHIB2fpEJF8fNSbNiQCzZ6a4cgk5X97XwCPdAj7u0H_tLmcg.DNes2dSh8D ZxlqWvdsTBQKIdn6VWamISeNkiIbbR6SoPpZL918KHav.u9SFS.jXX.zJBGa Srhfk1xeI09wWejzwEcBp.O5o9jt6KksNZDpPwtiyemgx45isjeB7QfgiN0P d9psR9tWFmDUXeu_616_4F46FB0VS.judLTeti92Fk79Nywawnagb0GeweFj JQ4F6fP_DKTaWMragLqDGepydG.ehoDfNsAJUul00y4pKFSyq Received: from [208.185.20.30] by web160906.mail.bf1.yahoo.com via HTTP; Fri, 22 Mar 2013 10:43:44 PDT X-Mailer: YahooMailWebService/0.8.138.524 Message-ID: <1363974224.25103.GenericBBA@web160906.mail.bf1.yahoo.com> Date: Fri, 22 Mar 2013 10:43:44 -0700 (PDT) From: Wei Zhu Reply-To: Wei Zhu Subject: Re: High disk I/O during reads To: user@cassandra.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org According to your cfstats, read latency is over 100 ms which is really real= ly slow. I am seeing less than 3ms reads for my cluster which is on SSD. Ca= n you also check the nodetool cfhistorgram, it tells you more about the num= ber of SSTable involved and read/write latency. Somtimes average doesn't te= ll you the whole storey. =0AAlso check your nodetool tpstats, are there a l= ot dropped reads?=0A=0A-Wei=0A----- Original Message -----=0AFrom: "Jon Sca= rborough" =0ATo: user@cassandra.apache.org=0ASent: Frid= ay, March 22, 2013 9:42:34 AM=0ASubject: Re: High disk I/O during reads=0A= =0AKey distribution across probably varies a lot from row to row in our cas= e. Most reads would probably only need to look at a few SSTables, a few mig= ht need to look at more. =0A=0AI don't yet have a deep understanding of C* = internals, but I would imagine even the more expensive use cases would invo= lve something like this: =0A=0A1) Check the index for each SSTable to deter= mine if part of the row is there. =0A2) Look at the endpoints of the slice = to determine if the data in a particular SSTable is relevant to the query. = =0A3) Read the chunks of those SSTables, working backwards from the end of = the slice until enough columns have been read to satisfy the limit clause i= n the query. =0A=0ASo I would have guessed that even the more expensive que= ries on wide rows typically wouldn't need to read more than a few hundred K= B from disk to do all that. Seems like I'm missing something major. =0A=0AH= ere's the complete CF definition, including compression settings: =0A=0ACRE= ATE COLUMNFAMILY conversation_text_message ( =0Aconversation_key bigint PRI= MARY KEY =0A) WITH =0Acomment=3D'' AND =0Acomparator=3D'CompositeType(org.a= pache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongTyp= e,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal= .AsciiType)' AND =0Aread_repair_chance=3D0.100000 AND =0Agc_grace_seconds= =3D864000 AND =0Adefault_validation=3Dtext AND =0Amin_compaction_threshold= =3D4 AND =0Amax_compaction_threshold=3D32 AND =0Areplicate_on_write=3DTrue = AND =0Acompaction_strategy_class=3D'SizeTieredCompactionStrategy' AND =0Aco= mpression_parameters:sstable_compression=3D'org.apache.cassandra.io.compres= s.SnappyCompressor'; =0A=0AMuch thanks for any additional ideas. =0A=0A-Jon= =0A=0A=0A=0AOn Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean < Dean.Hiller@nr= el.gov > wrote: =0A=0A=0ADid you mean to ask "are 'all' your keys spread ac= ross all SSTables"? I am guessing at your intention. =0A=0AI mean I would v= ery well hope my keys are spread across all sstables or otherwise that ssta= ble should not be there as he has no keys in it ;). =0A=0AAnd I know we had= HUGE disk size from the duplication in our sstables on size-tiered compact= ion=E2=80=A6.we never ran a major compaction but after we switched to LCS, = we went from 300G to some 120G or something like that which was nice. We on= ly have 300 data point posts / second so not an extreme write load on 6 nod= es as well though these posts causes read to check authorization and such o= f our system. =0A=0ADean =0A=0AFrom: Kanwar Sangha < kanwar@mavenir.com > =0AReply-To: " user@cassandra.apache.org " < user@cassandra.apache.org > =0ADate: Friday, March 22, 2013 8:38 AM =0ATo: " use= r@cassandra.apache.org " < user@cassand= ra.apache.org > =0ASubject: RE: High di= sk I/O during reads =0A=0A=0AAre your Keys spread across all SSTables ? Tha= t will cause every sstable read which will increase the I/O. =0A=0AWhat com= paction are you using ? =0A=0AFrom: zodiak@fifth-aeon.net [mailto: zodiak@fifth-aeon.net ] On Behalf Of Jon Scarborou= gh =0A=0ASent: 21 March 2013 23:00 =0ATo: user@cassandra.apache.org =0A=0A=0ASubject: High disk I/O during reads = =0A=0AHello, =0A=0AWe've had a 5-node C* cluster (version 1.1.0) running fo= r several months. Up until now we've mostly been writing data, but now we'r= e starting to service more read traffic. We're seeing far more disk I/O to = service these reads than I would have anticipated. =0A=0AThe CF being queri= ed consists of chat messages. Each row represents a conversation between tw= o people. Each column represents a message. The column key is composite, co= nsisting of the message date and a few other bits of information. The CF is= using compression. =0A=0AThe query is looking for a maximum of 50 messages= between two dates, in reverse order. Usually the two dates used as endpoin= ts are 30 days ago and the current time. The query in Astyanax looks like t= his: =0A=0AColumnList result =3D keyspace.prepa= reQuery(CF_CONVERSATION_TEXT_MESSAGE) =0A.setConsistencyLevel(ConsistencyLe= vel.CL_QUORUM) =0A.getKey(conversationKey) =0A.withColumnRange( =0AtextMess= ageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), =0AtextM= essageSerializer.makeEndpoint(startDate, Equality.GREATER_THAN_EQUALS).toBy= tes(), =0Atrue, =0AmaxMessages) =0A.execute() =0A.getResult(); =0A=0AWe're = currently servicing around 30 of these queries per second. =0A=0AHere's wha= t the cfstats for the CF look like: =0A=0AColumn Family: conversation_text_= message =0ASSTable count: 15 =0ASpace used (live): 211762982685 =0ASpace us= ed (total): 211762982685 =0ANumber of Keys (estimate): 330118528 =0AMemtabl= e Columns Count: 68063 =0AMemtable Data Size: 53093938 =0AMemtable Switch C= ount: 9743 =0ARead Count: 4313344 =0ARead Latency: 118.831 ms. =0AWrite Cou= nt: 817876950 =0AWrite Latency: 0.023 ms. =0APending Tasks: 0 =0ABloom Filt= er False Postives: 6055 =0ABloom Filter False Ratio: 0.00260 =0ABloom Filte= r Space Used: 686266048 =0ACompacted row minimum size: 87 =0ACompacted row = maximum size: 14530764 =0ACompacted row mean size: 1186 =0A=0AOn the C* nod= es, iostat output like this is typical, and can spike to be much worse: =0A= =0Aavg-cpu: %user %nice %system %iowait %steal %idle =0A1.91 0.00 2.08 30.6= 6 0.50 64.84 =0A=0ADevice: tps kB_read/s kB_wrtn/s kB_read kB_wrtn =0Axvdap= 1 0.13 0.00 1.07 0 16 =0Axvdb 474.20 13524.53 25.33 202868 380 =0Axvdc 469.= 87 13455.73 30.40 201836 456 =0Amd0 972.13 26980.27 55.73 404704 836 =0A=0A= Any thoughts on what could be causing read I/O to the disk from these queri= es? =0A=0AMuch thanks! =0A=0A-Jon =0A=0A