hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject CSLM performance Was: SILT - nice keyvalue store paper
Date Mon, 24 Oct 2011 17:19:23 GMT
Akash:
Take a look at the following two possible replacements for CSLM:
https://github.com/mspiegel/lockfreeskiptree
https://github.com/nbronson/snaptree

About your test, I got advice from other expert:

- the JVM warm-up penalty is being accrued by the CSLM run
- Use thread.yield() to end a request, as otherwise the active thread may be
able to run through its time slice without incurring much lock contention.

You can publish your code and results under HBASE-3992.

Thanks

On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jgray@fb.com> wrote:

> Oh, and when running these experiments, you should look at the impact at
> which order they are run in, whether you run them multiple times per JVM
> instance, etc.  Basically, you need to be cognizant of the HotSpot
> optimizations the JVM is doing at runtime.
>
> > -----Original Message-----
> > From: Jonathan Gray [mailto:jgray@fb.com]
> > Sent: Sunday, October 23, 2011 4:20 PM
> > To: dev@hbase.apache.org
> > Subject: RE: SILT - nice keyvalue store paper
> >
> > Very nice experiment, Akash.  Keep getting your hands dirty and digging!
>  :)
> >
> > I think your results might change if you bump the test up to 1000 threads
> or
> > so.  100 threads can still perform okay when there's a global lock but
> the
> > contention at 1000 threads will kill you and that's when CSLM should do
> much
> > better.  (1000 handler threads is approx. what I run with on RS in prod).
> > Though I am a bit surprised that at 100 threads the TreeMap was
> significantly
> > faster.  Your inconsistent results are a bit odd, you might try an order
> of
> > magnitude more operations per thread.  You might also gather some
> > statistics about tree size and per operation latency.
> >
> > I've done some isolated CSLM benchmarks in the past and have never been
> > able to reproduce any of the slowness people suggest.  I recall trying
> some
> > impractically large MemStores and everything still being quite fast.
> >
> > Over in Cassandra, I believe they have a two-level CSLM with the first
> map
> > key being the row and then the columns for each row in their own CSLM.
> > I've been told this is somewhat of a pain point for them.  And keep in
> mind
> > they have one shard/region per node and we generally have several smaller
> > MemStores on each node (tens to thousands).  Not sure we would want to
> > try that.  There could be some interesting optimizations if you had very
> > specific issues, like if you had a ton of reads to MemStore and not many
> > writes you could keep some kind of mirrored hashmap.
> >
> > And for writes, the WAL is definitely the latency bottleneck.  But if you
> are
> > doing lots of small operations, so your WALEdits are not large, and with
> some
> > of the HLog batching features going in to trunk, you end up with hundreds
> of
> > requests per HLog sync.  And although the syncs are higher latency, with
> > batching you end up getting high throughput.  And the bottleneck shifts.
> >
> > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
> sync
> > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
> 600K/sec,
> > very reasonable).  If you're mixing in reads as well (or if you're doing
> > increments which do a read and write), then this adds to the CPU usage
> and
> > contention without adding to HLog throughput.
> >
> > All of a sudden the bottleneck becomes CPU/contention and not HLog
> > latency or throughput.  Highly concurrent increments/counters with a
> largely
> > in-memory dataset can easily be CPU bottlenecked.
> >
> > For one specific application Dhruba and I worked on, we made some good
> > improvements in CPU efficiency by reducing the number of operations and
> > increasing efficiency on the CSLM.  Doing things like always taking a
> tailMap
> > and working from that instead of starting at the root node, using an
> iterator()
> > and taking advantage of the available remove() semantics, or simply just
> > mutating things that are normally immutable :)  Unfortunately many of
> these
> > optimizations were semi-horrid hacks and introduced things like
> > ModifiableKeyValues, so they all haven't made their way to apache.
> >
> > In the end, after our optimizations, the real world workload Dhruba and I
> > were working with was not all in-memory so the bottleneck in production
> > became the random reads (so increasing the block cache hit ratio is the
> > focus) rather than CPU contention or HLog throughput.
> >
> > JG
> >
> > From: Akash Ashok [mailto:thehellmaker@gmail.com]
> > Sent: Sunday, October 23, 2011 2:57 AM
> > To: dev@hbase.apache.org
> > Subject: Re: SILT - nice keyvalue store paper
> >
> > I was running some similar tests and came across a surprising finding. I
> > compared reads and write on ConcurrentSkipListMap ( which the memstore
> > uses) and synchronized TreeMap ( Which was literally treemap
> > synchronized). Executed concurrent reads, writes and deletes on both of
> > them.
> > Surprisingly synchronized treeMap performed better, though just slightly
> > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
> >
> > Here are the output of a few runs
> >
> > Sometimes the difference was considerable Using HBaseMap it took
> > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> >
> > And sometimes the difference was negligible Using HBaseMap it took
> > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> >
> > I've attaching the test  java file which I wrote to test it.
> > This might be a very minor differece but still surprising considering the
> fact
> > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
> > improves the deletion performance.
> >
> > And here are the details about the test run.
> > 100 Threads each fetching 1,000,000 records
> > 100 threads each adding 1,000,000 records.
> > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
> > simultaneously )
> >
> > Cheers,
> > Akash A
> > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> > <stack@duboce.net<mailto:stack@duboce.net>> wrote:
> > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> > <nkeywal@gmail.com<mailto:nkeywal@gmail.com>> wrote:
> > > I would think that the bottleneck for insert is the wal part?
> > > It would be possible to do a kind of memory list preparation during
> > > the wal insertion, and if the wal insertion is confirmed, do the
> > > insertion in the memory list. But it's strange to have the insertion in
> > memory important vs.
> > > the insertion on disk...
> > >
> > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
> says
> > cpu above.  Reading and writing it is also 'slow'.
> > St.Ack
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message