From cassandra-user-return-2550-apmail-incubator-cassandra-user-archive=incubator.apache.org@incubator.apache.org Tue Feb 16 17:51:09 2010 Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 60655 invoked from network); 16 Feb 2010 17:51:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Feb 2010 17:51:09 -0000 Received: (qmail 56719 invoked by uid 500); 16 Feb 2010 17:51:08 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 56702 invoked by uid 500); 16 Feb 2010 17:51:08 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 56693 invoked by uid 99); 16 Feb 2010 17:51:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 17:51:08 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of weijunli@gmail.com designates 209.85.222.172 as permitted sender) Received: from [209.85.222.172] (HELO mail-pz0-f172.google.com) (209.85.222.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Feb 2010 17:51:00 +0000 Received: by pzk2 with SMTP id 2so96843pzk.21 for ; Tue, 16 Feb 2010 09:50:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=NzEITrXNSLrL7VG0mtppA2711WCaB6GX887hagqPXnU=; b=n/dS0JpUqpaI9TM+UB1qgmrGo/iTkKyr22I914HkTXsWjfzRm1O5acI7O8nF+DnCBD ospMz5latXsKvi8XaaI7q4ZaJl3JnYjcaKDOC5kvBZOa5aTkbffR4ZRQjklTHo4bZfii PiS8QIhlWhc5QwdWEFtaF+UyKq9UF6PxT5Hfk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=TGXy+CLPmOvYztH/Q+Xa5DNXoadt5sGofsQR5nP6b5gaZsQ1sGjrd4H3xPH8vdC8eb jJZHSgcv/o0i7vSqn5YqY/K/E4tKZ7lqrktc1RqS+Dh7jm46wmh5yXXGCCGEa+0XAxZe a14EQvomMXYnkal/VKIyKkutAeWqRGGMgBpE8= MIME-Version: 1.0 Received: by 10.115.67.11 with SMTP id u11mr3050153wak.69.1266342638744; Tue, 16 Feb 2010 09:50:38 -0800 (PST) In-Reply-To: References: <468b21171001200244n2521e77esa84964946f0eb20b@mail.gmail.com> <468b21171001240054tb7757va64fdb54824854fe@mail.gmail.com> <468b21171001240220u3414109dj1560fbd65b82ecfa@mail.gmail.com> <012601caade5$21a79fb0$64f6df10$@com> <005101caaedd$2bb3ce90$831b6bb0$@com> <1NhIrQ-0007Jt-In@mail.eleven.de> Date: Tue, 16 Feb 2010 09:50:38 -0800 Message-ID: Subject: Re: Cassandra benchmark shows OK throughput but high read latency (> 100ms)? From: Weijun Li To: cassandra-user@incubator.apache.org Content-Type: multipart/alternative; boundary=0016e64de7502576da047fbb5fc3 X-Virus-Checked: Checked by ClamAV on apache.org --0016e64de7502576da047fbb5fc3 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluste= r the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.67 67.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 47.67 67.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams wrote: > On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabm=FCller < > Martin.Grabmueller@eleven.de> wrote: > >> In my tests I have observed that good read latency depends on keeping >> the number of data files low. In my current test setup, I have stored >> 1.9 TB of data on a single node, which is in 21 data files, and read >> latency is between 10 and 60ms (for small reads, larger read of course >> take more time). In earlier stages of my test, I had up to 5000 >> data files, and read performance was quite bad: my configured 10-second >> RPC timeout was regularly encountered. >> > > I believe it is known that crossing sstables is O(NlogN) but I'm unable t= o > find the ticket on this at the moment. Perhaps Stu Hood will jump in and > enlighten me, but in any case I believe > https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve > it. > > Keeping write volume low enough that compaction can keep up is one > solution, and throwing hardware at the problem is another, if necessary. > Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly fo= r > repeat hits. > > -Brandon > --0016e64de7502576da047fbb5fc3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Dumped 50mil records into my 2-node cluster overnight, made sure that there= 's not many data files (around 30 only) per Martin's suggestion. Th= e size of the data directory is 63GB. Now when I read records from the clus= ter the read latency is still ~44ms, --there's no write happening durin= g the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is sa= turated:

Device:=A0=A0=A0=A0=A0=A0=A0=A0 rrqm/s=A0=A0 wrqm/s=A0=A0 r/s=A0=A0 w/s= =A0=A0 rsec/s=A0=A0 wsec/s avgrq-sz avgqu-sz=A0=A0 await=A0 svctm=A0 %util<= br>sda=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 47.67=A0=A0=A0 67.67 190.33 1= 7.00 23933.33=A0=A0 677.33=A0=A0 118.70=A0=A0=A0=A0 5.24=A0=A0 25.25=A0=A0 = 4.64=A0 96.17
sda1=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0.00=A0=A0=A0= =A0 0.00=A0 0.00=A0 0.00=A0=A0=A0=A0 0.00=A0=A0=A0=A0 0.00=A0=A0=A0=A0 0.00= =A0=A0=A0=A0 0.00=A0=A0=A0 0.00=A0=A0 0.00=A0=A0 0.00
sda2=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 47.67=A0=A0=A0 67.67 190.33 17.00 = 23933.33=A0=A0 677.33=A0=A0 118.70=A0=A0=A0=A0 5.24=A0=A0 25.25=A0=A0 4.64= =A0 96.17
sda3=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0.00=A0=A0=A0=A0 0= .00=A0 0.00=A0 0.00=A0=A0=A0=A0 0.00=A0=A0=A0=A0 0.00=A0=A0=A0=A0 0.00=A0= =A0=A0=A0 0.00=A0=A0=A0 0.00=A0=A0 0.00=A0=A0 0.00

CPU usage is low.=

Does this mean disk i/o is the bottleneck for my case? Will it help if = I increase KCF to cache all sstable index?

Also, this is the almost= a read-only mode test, and in reality, our write/read ratio is close to 1:= 1 so I'm guessing read latency will even go higher in that case because= there will be difficult for cassandra to find a good moment to compact the= data files that are being busy written.

Thanks,
-Weijun
=A0

On Tue, Feb= 16, 2010 at 6:06 AM, Brandon Williams <driftx@gmail.com> wrote:
On Tue, Feb 16, 2010 at 2:32 A= M, Dr. Martin Grabm=FCller <Martin.Grabmueller@eleven.de>= ; wrote:
In my tests I have observed that good read latency depends on keeping<= /div> the number of data files low. =A0In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read
latency is between 10 and 60ms (for small reads, larger read of course
take more time). =A0In earlier stages of my test, I had up to 5000
data files, and read performance was quite bad: my configured 10-second
RPC timeout was regularly encountered.

I believe it is known that crossing sstables is O(NlogN) but I'm = unable to find the ticket on this at the moment. =A0Perhaps Stu Hood will j= ump in and enlighten me, but in any case I believe https://issues.ap= ache.org/jira/browse/CASSANDRA-674 will eventually solve it.

Keeping write volume low enough that compaction can kee= p up is one solution, and throwing hardware at the problem is another, if n= ecessary. =A0Also, the row caching in trunk (soon to be 0.6 we hope) helps = greatly for repeat hits.

-Brandon

--0016e64de7502576da047fbb5fc3--