Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D1D1C95B9 for ; Mon, 24 Oct 2011 00:05:59 +0000 (UTC) Received: (qmail 9828 invoked by uid 500); 24 Oct 2011 00:05:59 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 9790 invoked by uid 500); 24 Oct 2011 00:05:59 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 9781 invoked by uid 99); 24 Oct 2011 00:05:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Oct 2011 00:05:59 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jgray@fb.com designates 67.231.145.42 as permitted sender) Received: from [67.231.145.42] (HELO mx0a-00082601.pphosted.com) (67.231.145.42) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Oct 2011 00:05:52 +0000 Received: from pps.filterd (m0004348 [127.0.0.1]) by m0004348.ppops.net (8.14.4/8.14.4) with SMTP id p9O03wrc006156 for ; Sun, 23 Oct 2011 17:05:32 -0700 Received: from mail.thefacebook.com (corpout1.snc1.tfbnw.net [66.220.144.38]) by m0004348.ppops.net with ESMTP id 10mvcu817b-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Sun, 23 Oct 2011 17:05:32 -0700 Received: from SC-MBX02-1.TheFacebook.com ([fe80::e543:4f2a:16b1:b828]) by sc-hub04.TheFacebook.com ([192.168.18.212]) with mapi id 14.01.0289.001; Sun, 23 Oct 2011 17:05:26 -0700 From: Jonathan Gray To: "dev@hbase.apache.org" Subject: RE: SILT - nice keyvalue store paper Thread-Topic: SILT - nice keyvalue store paper Thread-Index: AQHMiWsComyRFfD2PkGUe+Or9zOKCpWIiH6AgADeiICAAAIwgIAAA+kAgADJqwCAAFkukIAAHhxQ Date: Mon, 24 Oct 2011 00:05:26 +0000 Message-ID: <4A2E351F553A2C4D89148CC937FA81821F37D1F9@SC-MBX02-1.TheFacebook.com> References: <4A2E351F553A2C4D89148CC937FA81821F37C14A@SC-MBX02-1.TheFacebook.com> In-Reply-To: <4A2E351F553A2C4D89148CC937FA81821F37C14A@SC-MBX02-1.TheFacebook.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.18.252] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.4.6813,1.0.211,0.0.0000 definitions=2011-10-23_07:2011-10-21,2011-10-23,1970-01-01 signatures=0 X-Proofpoint-Spam-Reason: safe Oh, and when running these experiments, you should look at the impact at wh= ich order they are run in, whether you run them multiple times per JVM inst= ance, etc. Basically, you need to be cognizant of the HotSpot optimization= s the JVM is doing at runtime. > -----Original Message----- > From: Jonathan Gray [mailto:jgray@fb.com] > Sent: Sunday, October 23, 2011 4:20 PM > To: dev@hbase.apache.org > Subject: RE: SILT - nice keyvalue store paper >=20 > Very nice experiment, Akash. Keep getting your hands dirty and digging! = :) >=20 > I think your results might change if you bump the test up to 1000 threads= or > so. 100 threads can still perform okay when there's a global lock but th= e > contention at 1000 threads will kill you and that's when CSLM should do m= uch > better. (1000 handler threads is approx. what I run with on RS in prod). > Though I am a bit surprised that at 100 threads the TreeMap was significa= ntly > faster. Your inconsistent results are a bit odd, you might try an order = of > magnitude more operations per thread. You might also gather some > statistics about tree size and per operation latency. >=20 > I've done some isolated CSLM benchmarks in the past and have never been > able to reproduce any of the slowness people suggest. I recall trying so= me > impractically large MemStores and everything still being quite fast. >=20 > Over in Cassandra, I believe they have a two-level CSLM with the first ma= p > key being the row and then the columns for each row in their own CSLM. > I've been told this is somewhat of a pain point for them. And keep in mi= nd > they have one shard/region per node and we generally have several smaller > MemStores on each node (tens to thousands). Not sure we would want to > try that. There could be some interesting optimizations if you had very > specific issues, like if you had a ton of reads to MemStore and not many > writes you could keep some kind of mirrored hashmap. >=20 > And for writes, the WAL is definitely the latency bottleneck. But if you= are > doing lots of small operations, so your WALEdits are not large, and with = some > of the HLog batching features going in to trunk, you end up with hundreds= of > requests per HLog sync. And although the syncs are higher latency, with > batching you end up getting high throughput. And the bottleneck shifts. >=20 > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog syn= c > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 bytes/req =3D 600K/s= ec, > very reasonable). If you're mixing in reads as well (or if you're doing > increments which do a read and write), then this adds to the CPU usage an= d > contention without adding to HLog throughput. >=20 > All of a sudden the bottleneck becomes CPU/contention and not HLog > latency or throughput. Highly concurrent increments/counters with a larg= ely > in-memory dataset can easily be CPU bottlenecked. >=20 > For one specific application Dhruba and I worked on, we made some good > improvements in CPU efficiency by reducing the number of operations and > increasing efficiency on the CSLM. Doing things like always taking a tai= lMap > and working from that instead of starting at the root node, using an iter= ator() > and taking advantage of the available remove() semantics, or simply just > mutating things that are normally immutable :) Unfortunately many of the= se > optimizations were semi-horrid hacks and introduced things like > ModifiableKeyValues, so they all haven't made their way to apache. >=20 > In the end, after our optimizations, the real world workload Dhruba and I > were working with was not all in-memory so the bottleneck in production > became the random reads (so increasing the block cache hit ratio is the > focus) rather than CPU contention or HLog throughput. >=20 > JG >=20 > From: Akash Ashok [mailto:thehellmaker@gmail.com] > Sent: Sunday, October 23, 2011 2:57 AM > To: dev@hbase.apache.org > Subject: Re: SILT - nice keyvalue store paper >=20 > I was running some similar tests and came across a surprising finding. I > compared reads and write on ConcurrentSkipListMap ( which the memstore > uses) and synchronized TreeMap ( Which was literally treemap > synchronized). Executed concurrent reads, writes and deletes on both of > them. > Surprisingly synchronized treeMap performed better, though just slightly > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses. >=20 > Here are the output of a few runs >=20 > Sometimes the difference was considerable Using HBaseMap it took > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms >=20 > And sometimes the difference was negligible Using HBaseMap it took > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms >=20 > I've attaching the test java file which I wrote to test it. > This might be a very minor differece but still surprising considering the= fact > that ConcurrentSkipListMap uses fancy 2 level indexes which they say > improves the deletion performance. >=20 > And here are the details about the test run. > 100 Threads each fetching 1,000,000 records > 100 threads each adding 1,000,000 records. > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes > simultaneously ) >=20 > Cheers, > Akash A > On Sun, Oct 23, 2011 at 3:25 AM, Stack > > wrote: > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal > > wrote: > > I would think that the bottleneck for insert is the wal part? > > It would be possible to do a kind of memory list preparation during > > the wal insertion, and if the wal insertion is confirmed, do the > > insertion in the memory list. But it's strange to have the insertion in > memory important vs. > > the insertion on disk... > > > Yes, WAL is the long pole writing. But MemStore has issues too; Dhruba s= ays > cpu above. Reading and writing it is also 'slow'. > St.Ack