Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jgray@fb.com designates
 67.231.145.42 as permitted sender)
From: Jonathan Gray <jgray@fb.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Subject: RE: SILT - nice keyvalue store paper
Thread-Topic: SILT - nice keyvalue store paper
Thread-Index: 
 AQHMiWsComyRFfD2PkGUe+Or9zOKCpWIiH6AgADeiICAAAIwgIAAA+kAgADJqwCAAFkukIAAHhxQ
Date: Mon, 24 Oct 2011 00:05:26 +0000
Message-ID: 
 <4A2E351F553A2C4D89148CC937FA81821F37D1F9@SC-MBX02-1.TheFacebook.com>
References: 
 <CADY20s7iLL3AD43k8Y0Z9LFdTTvQxdNiwgw_u7PTuKbHpmE-MA@mail.gmail.com>
	<CAMwMvV6PWiAy5kB3tBsSO2te7FJGf0=1qv6-_+njRq0mJ0Ka0w@mail.gmail.com>
	<CADcMMgGOHSeC4cOFwUOWwVqx4Vpwboj-X7-yM8g8vYDg0LqUbw@mail.gmail.com>
	<CAPcDmSvi3D2pYn65b9C20wj1bkTXkwovSsX3HtHivhW5=-rSAQ@mail.gmail.com>
	<CADcMMgG6o4xcnpTGaw-zKCY1TfOXcsLyb2VJLqtvPtDtcWD-yA@mail.gmail.com>
 <CAOTGtFz0JkC+rAQkRrXpQ__MudjpBocvEQguDDpcDOeixd7YSQ@mail.gmail.com>
 <4A2E351F553A2C4D89148CC937FA81821F37C14A@SC-MBX02-1.TheFacebook.com>
In-Reply-To: 
 <4A2E351F553A2C4D89148CC937FA81821F37C14A@SC-MBX02-1.TheFacebook.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Oh, and when running these experiments, you should look at the impact at wh=
ich order they are run in, whether you run them multiple times per JVM inst=
ance, etc.  Basically, you need to be cognizant of the HotSpot optimization=
s the JVM is doing at runtime.

> -----Original Message-----
> From: Jonathan Gray [mailto:jgray@fb.com]
> Sent: Sunday, October 23, 2011 4:20 PM
> To: dev@hbase.apache.org
> Subject: RE: SILT - nice keyvalue store paper
>=20
> Very nice experiment, Akash.  Keep getting your hands dirty and digging! =
 :)
>=20
> I think your results might change if you bump the test up to 1000 threads=
 or
> so.  100 threads can still perform okay when there's a global lock but th=
e
> contention at 1000 threads will kill you and that's when CSLM should do m=
uch
> better.  (1000 handler threads is approx. what I run with on RS in prod).
> Though I am a bit surprised that at 100 threads the TreeMap was significa=
ntly
> faster.  Your inconsistent results are a bit odd, you might try an order =
of
> magnitude more operations per thread.  You might also gather some
> statistics about tree size and per operation latency.
>=20
> I've done some isolated CSLM benchmarks in the past and have never been
> able to reproduce any of the slowness people suggest.  I recall trying so=
me
> impractically large MemStores and everything still being quite fast.
>=20
> Over in Cassandra, I believe they have a two-level CSLM with the first ma=
p
> key being the row and then the columns for each row in their own CSLM.
> I've been told this is somewhat of a pain point for them.  And keep in mi=
nd
> they have one shard/region per node and we generally have several smaller
> MemStores on each node (tens to thousands).  Not sure we would want to
> try that.  There could be some interesting optimizations if you had very
> specific issues, like if you had a ton of reads to MemStore and not many
> writes you could keep some kind of mirrored hashmap.
>=20
> And for writes, the WAL is definitely the latency bottleneck.  But if you=
 are
> doing lots of small operations, so your WALEdits are not large, and with =
some
> of the HLog batching features going in to trunk, you end up with hundreds=
 of
> requests per HLog sync.  And although the syncs are higher latency, with
> batching you end up getting high throughput.  And the bottleneck shifts.
>=20
> Each sync will take approx. 1-5ms, so let's say 250 requests per HLog syn=
c
> batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =3D 600K/s=
ec,
> very reasonable).  If you're mixing in reads as well (or if you're doing
> increments which do a read and write), then this adds to the CPU usage an=
d
> contention without adding to HLog throughput.
>=20
> All of a sudden the bottleneck becomes CPU/contention and not HLog
> latency or throughput.  Highly concurrent increments/counters with a larg=
ely
> in-memory dataset can easily be CPU bottlenecked.
>=20
> For one specific application Dhruba and I worked on, we made some good
> improvements in CPU efficiency by reducing the number of operations and
> increasing efficiency on the CSLM.  Doing things like always taking a tai=
lMap
> and working from that instead of starting at the root node, using an iter=
ator()
> and taking advantage of the available remove() semantics, or simply just
> mutating things that are normally immutable :)  Unfortunately many of the=
se
> optimizations were semi-horrid hacks and introduced things like
> ModifiableKeyValues, so they all haven't made their way to apache.
>=20
> In the end, after our optimizations, the real world workload Dhruba and I
> were working with was not all in-memory so the bottleneck in production
> became the random reads (so increasing the block cache hit ratio is the
> focus) rather than CPU contention or HLog throughput.
>=20
> JG
>=20
> From: Akash Ashok [mailto:thehellmaker@gmail.com]
> Sent: Sunday, October 23, 2011 2:57 AM
> To: dev@hbase.apache.org
> Subject: Re: SILT - nice keyvalue store paper
>=20
> I was running some similar tests and came across a surprising finding. I
> compared reads and write on ConcurrentSkipListMap ( which the memstore
> uses) and synchronized TreeMap ( Which was literally treemap
> synchronized). Executed concurrent reads, writes and deletes on both of
> them.
> Surprisingly synchronized treeMap performed better, though just slightly
> better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>=20
> Here are the output of a few runs
>=20
> Sometimes the difference was considerable Using HBaseMap it took
> 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>=20
> And sometimes the difference was negligible Using HBaseMap it took
> 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>=20
> I've attaching the test  java file which I wrote to test it.
> This might be a very minor differece but still surprising considering the=
 fact
> that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> improves the deletion performance.
>=20
> And here are the details about the test run.
> 100 Threads each fetching 1,000,000 records
> 100 threads each adding 1,000,000 records.
> 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> simultaneously )
>=20
> Cheers,
> Akash A
> On Sun, Oct 23, 2011 at 3:25 AM, Stack
> <stack@duboce.net<mailto:stack@duboce.net>> wrote:
> On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> <nkeywal@gmail.com<mailto:nkeywal@gmail.com>> wrote:
> > I would think that the bottleneck for insert is the wal part?
> > It would be possible to do a kind of memory list preparation during
> > the wal insertion, and if the wal insertion is confirmed, do the
> > insertion in the memory list. But it's strange to have the insertion in
> memory important vs.
> > the insertion on disk...
> >
> Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba s=
ays
> cpu above.  Reading and writing it is also 'slow'.
> St.Ack