Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 13652 invoked from network); 21 Feb 2011 19:21:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Feb 2011 19:21:28 -0000 Received: (qmail 30537 invoked by uid 500); 21 Feb 2011 19:21:25 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 30514 invoked by uid 500); 21 Feb 2011 19:21:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 30498 invoked by uid 99); 21 Feb 2011 19:21:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Feb 2011 19:21:21 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_FROM,FSL_RU_URL,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 209.85.218.44 as permitted sender) Received: from [209.85.218.44] (HELO mail-yi0-f44.google.com) (209.85.218.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Feb 2011 19:21:15 +0000 Received: by yie19 with SMTP id 19so2936604yie.31 for ; Mon, 21 Feb 2011 11:20:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:content-transfer-encoding; bh=qqAu+5LSp9Ngf0kM9gWr6D8Bepi47M7AgCnWiweepmw=; b=s9cr7u4QzS7sh73vLINdqAZ1Gdgq3LGQEMV1iWzxZ13rCyeAmUqLimt+A8jLR/iRGY gVof/jBAoidjGlTG8gk69GGerpOndmT34zjj09IXLaNvPpeOqjvR08Wiy4spIzN4H4mk yDvwoKU52LRO00cFCAovo0WQ2zDnWf9ZMPpvY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; b=ttjxlJdmrr+8xBsna1RmKlix1dead4vBY3nozYOhn19runn/56sPpm+I/BwTQddWX5 DYVsX2/EwvDiJXCYcJh+fFU9xEqkK85whJIiu4YicbkZApql6ThnYU97o9kym2KOyG0y uYtbCO4SGOeKSLOg5yahOywVD467C27k06DEI= Received: by 10.236.185.202 with SMTP id u50mr3147884yhm.52.1298316054041; Mon, 21 Feb 2011 11:20:54 -0800 (PST) MIME-Version: 1.0 Received: by 10.236.105.237 with HTTP; Mon, 21 Feb 2011 11:20:04 -0800 (PST) In-Reply-To: <4D62AD6A.3030009@bk.ru> References: <4D5FD0CB.2040608@bk.ru> <4D5FF0B7.2090308@bk.ru> <4D6088D0.90403@bk.ru> <4D62556A.1020202@bk.ru> <4D62AD6A.3030009@bk.ru> From: Jonathan Ellis Date: Mon, 21 Feb 2011 13:20:04 -0600 Message-ID: Subject: Re: 0.7.2 slow memtables flushing To: dev@cassandra.apache.org Cc: Ivan Georgiev , user@cassandra.apache.org Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org BRAF.seek has not changed since 0.7.0. Here is the implementation: public void seek(long newPosition) throws IOException { current =3D newPosition; if (newPosition >=3D bufferEnd || newPosition < bufferOffset) { reBuffer(); // this will set bufferEnd for us } final int delta =3D (int) (newPosition - bufferOffset); buffer.position(delta); } In other words: if we are seeking outside of the part of the file that is currently buffered, we have to call rebuffer. Now, you might wonder why seek is being called at all in SSTW.append. Here is the relevant part (also unchanged since 0.7.0): // write placeholder for the row size, since we don't know it yet long sizePosition =3D dataFile.getFilePointer(); dataFile.writeLong(-1); // write out row data int columnCount =3D ColumnFamily.serializer().serializeWithIndexes(cf, dataFile); // seek back and write the row size (not including the size Long it= self) long endPosition =3D dataFile.getFilePointer(); dataFile.seek(sizePosition); dataFile.writeLong(endPosition - (sizePosition + 8)); So basically once your row size is larger than the BRAF buffer you can expect to do a lot of rebuffer calls during flush or compact. Sounds to me like the difference you are seeing is caused by your row size and not a regression in Cassandra. 2011/2/21 Ivan Georgiev : > Some more digging. This is the code path causing the excessive rebuffer() > calls. > > java.lang.Exception: Stack trace > =9A =9Aat java.lang.Thread.dumpStack(Unknown Source) > =9A =9Aat > org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRa= ndomAccessFile.java:204) > =9A =9Aat > org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandom= AccessFile.java:394) > =9A =9Aat > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:1= 48) > =9A =9Aat > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:159) > =9A =9Aat org.apache.cassandra.db.Memtable.access$000(Memtable.java:49) > =9A =9Aat org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:17= 4) > =9A =9Aat > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) > =9A =9Aat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown = Source) > =9A =9Aat java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Sour= ce) > =9A =9Aat java.lang.Thread.run(Unknown Source) > > Basically the rebuffer() is being called through seek(). In 0.7.0 seek() > does not call rebuffer() while 0.7.1 and on does > and that is where the performance degradation is happening in my test cas= e. > Any ideas how to alleviate the problem ? > > Ivan > > On 21.2.2011 =C7. 14:07 =DE., Ivan Georgiev wrote: >> >> I did some very rough measurements in a desperate attempt to see if I ca= n >> find the issue (if there is an issue). >> Since I dont know the code base well enough i chose >> BufferedRandomAccessFile as my suspect, since it was rewritten from 0.7.= 0 to >> 0.7.1 >> >> I did rough measurements on how many times write() and rebuffer() were >> called, and how much time was spent in them. The measurements >> were printed out when close() on the file was called. Time measurements >> are rounded in seconds. I have included the measurements only for the >> large files that were flushed to disk. >> >> 0.7.0 >> Total time in reBuffer : 0 seconds. >> Total times called reBuffer : 2 >> Total time in write : 0 seconds. >> Total times called write : 579374 >> >> 0.7.2 >> Time spent in reBuffer : 67 seconds. >> Times called reBuffer : 30888 >> Time spent in write : 0 seconds. >> Times called write : 1884107 >> >> Seems like rebuffer is being called a lot more in 0.7.2 and thats where >> the performance degradation is coming from. >> >> rebuffer can end up being called from write itself, but since the time >> spent in write is insignificant, that means this excessive >> calling of rebuffer in 0.7.2 is not happening from the write path. >> >> Ivan >> >> >> On 20.2.2011 =C7. 05:21 =DE., Ivan Georgiev wrote: >>> >>> Is it possible that the changes to the BufferedRandomAccessFile.java >>> could be causing the issue ? >>> I think the most notable change there is using ByteBuffer instead of a >>> byte[] for the buffer. >>> >>> I will run more tests and see what comes out of it. >>> >>> Ivan >>> >>> On 20.2.2011 =C7. 05:03 =DE., Jonathan Ellis wrote: >>>> >>>> I guess you'll need to binary search through the 0.7.1 changes to find >>>> what made the difference. =9AI can't think of any obvious candidates. >>>> >>>> On Sat, Feb 19, 2011 at 10:32 AM, Ivan Georgiev =9Awrot= e: >>>>> >>>>> On 19.2.2011 =C7. 16:43 =DE., Jonathan Ellis wrote: >>>>>> >>>>>> Flush code didn't change between 0.7.0 and 0.7.2. There must be some >>>>>> other variable here. Memory pressure maybe? >>>>> >>>>> Cloud you please elaborate on that one ? >>>>> The conditions are exactly the same for the test with 0.7.0 and 0.7.2= . >>>>> By the way, 0.7.1 tests are similar to 0.7.2, while 0.7 early betas t= o >>>>> 0.7.0 >>>>> is fine. >>>>> >>>>> Ivan >>>>> >>>>> On 19.2.2011 =C7. 16:43 =DE., Jonathan Ellis wrote: >>>>>> >>>>>> Flush code didn't change between 0.7.0 and 0.7.2. There must be some >>>>>> other variable here. Memory pressure maybe? >>>>>> >>>>>> On Sat, Feb 19, 2011 at 8:16 AM, Ivan Georgiev >>>>>> =9Awrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am testing 0.7.2 on a Windows 2003 x64 system(one node) and I am >>>>>>> having >>>>>>> the following problem. >>>>>>> My insertion speed is relatively slow, so the memtables do not get >>>>>>> full >>>>>>> and >>>>>>> the actual flushing is triggered by memtable_flush_after_mins, this >>>>>>> happens >>>>>>> on the hour mark. My problem with 0.7.2 is that when that happens, >>>>>>> the >>>>>>> cpu >>>>>>> spikes to 25% overall usage (x4 Xeon) and the operation takes >>>>>>> anywhere >>>>>>> from >>>>>>> 2 to 4 minutes, leaving the node not responding during that time. >>>>>>> This >>>>>>> has >>>>>>> forced to me to increase the rpc timeout option to beyond what i fe= el >>>>>>> comfortable with. >>>>>>> I have run multiple tests with 0.7.0 and 0.7.2 with the same datase= t >>>>>>> and >>>>>>> the >>>>>>> results are consistent. During the same operation 0.7.0 takes about >>>>>>> 10 >>>>>>> seconds to complete vs. 2 to 4 minutes for 0.7.2. I am attaching a >>>>>>> log >>>>>>> with >>>>>>> the timestamps from one such flushing of 0.7.2. Please let me know = if >>>>>>> there >>>>>>> is anything i can do to speed up and get results similar to 0.7.0. >>>>>>> >>>>>>> Regards: >>>>>>> Ivan >>>>>>> >>>>>>> This is the log of the operations which took most time during the >>>>>>> flush >>>>>>> operation. Using 0.7.0, with the same number of operations the >>>>>>> flushing >>>>>>> takes less than 10 seconds. >>>>>>> >>>>>>> INFO 01:36:44,906 Enqueuing flush of >>>>>>> Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes, 3446= 7 >>>>>>> operations) >>>>>>> INFO 01:37:47,375 Completed flushing >>>>>>> C:\Storage\data\Storage\ArchiveFiles.68617368-f-3-Data.db (5549187 >>>>>>> bytes) >>>>>>> INFO 01:37:47,375 Writing >>>>>>> Memtable-ArchiveFiles.6e616d65@978152661(1619949 >>>>>>> bytes, 34467 operations) >>>>>>> INFO 01:37:47,375 Enqueuing flush of >>>>>>> Memtable-ArchiveFiles.706172656e74466f6c6465724944@2097700961(16199= 49 >>>>>>> bytes, >>>>>>> 34467 operations) >>>>>>> INFO 01:38:51,343 Completed flushing >>>>>>> C:\Storage\data\Storage\ArchiveFiles.6e616d65-f-3-Data.db (3820265 >>>>>>> bytes) >>>>>>> INFO 01:38:51,343 Writing >>>>>>> Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes, 3446= 7 >>>>>>> operations) >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com