hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vrodio...@carrieriq.com>
Subject RE: SILT - nice keyvalue store paper
Date Mon, 24 Oct 2011 18:43:55 GMT
Jonathan, 1000 threads is a bad application design. They will kill you even w/o contention.
In my opinion, over-usage of a *modern*  Java concurrent framework
stuff w/o real understanding of a benefits is a major contributor to a poor performance and
poor scalability of a Java application. Before making any decision
on using a fancy data structure which can potentially affect application performance it is
always a good idea to run some benchmarks. 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

From: Jonathan Gray [jgray@fb.com]
Sent: Sunday, October 23, 2011 4:20 PM
To: dev@hbase.apache.org
Subject: RE: SILT - nice keyvalue store paper

Very nice experiment, Akash.  Keep getting your hands dirty and digging!  :)

I think your results might change if you bump the test up to 1000 threads or so.  100 threads
can still perform okay when there's a global lock but the contention at 1000 threads will
kill you and that's when CSLM should do much better.  (1000 handler threads is approx. what
I run with on RS in prod).  Though I am a bit surprised that at 100 threads the TreeMap was
significantly faster.  Your inconsistent results are a bit odd, you might try an order of
magnitude more operations per thread.  You might also gather some statistics about tree size
and per operation latency.

I've done some isolated CSLM benchmarks in the past and have never been able to reproduce
any of the slowness people suggest.  I recall trying some impractically large MemStores and
everything still being quite fast.

Over in Cassandra, I believe they have a two-level CSLM with the first map key being the row
and then the columns for each row in their own CSLM.  I've been told this is somewhat of a
pain point for them.  And keep in mind they have one shard/region per node and we generally
have several smaller MemStores on each node (tens to thousands).  Not sure we would want to
try that.  There could be some interesting optimizations if you had very specific issues,
like if you had a ton of reads to MemStore and not many writes you could keep some kind of
mirrored hashmap.

And for writes, the WAL is definitely the latency bottleneck.  But if you are doing lots of
small operations, so your WALEdits are not large, and with some of the HLog batching features
going in to trunk, you end up with hundreds of requests per HLog sync.  And although the syncs
are higher latency, with batching you end up getting high throughput.  And the bottleneck

Each sync will take approx. 1-5ms, so let's say 250 requests per HLog sync batch, 4ms per
sync, so 62.5k req/sec.  (62.5k * 100 bytes/req = 600K/sec, very reasonable).  If you're mixing
in reads as well (or if you're doing increments which do a read and write), then this adds
to the CPU usage and contention without adding to HLog throughput.

All of a sudden the bottleneck becomes CPU/contention and not HLog latency or throughput.
 Highly concurrent increments/counters with a largely in-memory dataset can easily be CPU

For one specific application Dhruba and I worked on, we made some good improvements in CPU
efficiency by reducing the number of operations and increasing efficiency on the CSLM.  Doing
things like always taking a tailMap and working from that instead of starting at the root
node, using an iterator() and taking advantage of the available remove() semantics, or simply
just mutating things that are normally immutable :)  Unfortunately many of these optimizations
were semi-horrid hacks and introduced things like ModifiableKeyValues, so they all haven't
made their way to apache.

In the end, after our optimizations, the real world workload Dhruba and I were working with
was not all in-memory so the bottleneck in production became the random reads (so increasing
the block cache hit ratio is the focus) rather than CPU contention or HLog throughput.


From: Akash Ashok [mailto:thehellmaker@gmail.com]
Sent: Sunday, October 23, 2011 2:57 AM
To: dev@hbase.apache.org
Subject: Re: SILT - nice keyvalue store paper

I was running some similar tests and came across a surprising finding. I compared reads and
write on ConcurrentSkipListMap ( which the memstore uses) and synchronized TreeMap ( Which
was literally treemap synchronized). Executed concurrent reads, writes and deletes on both
of them.
Surprisingly synchronized treeMap performed better, though just slightly better, than ConcurrentSkipListMap
which KeyValueSkipListSet uses.

Here are the output of a few runs

Sometimes the difference was considerable
Using HBaseMap it took 20438ms
Using TreeMap it took 11613ms
Time Difference:8825ms

And sometimes the difference was negligible
Using HBaseMap it took 13370ms
Using TreeMap it took 9482ms
Time Difference:3888ms

I've attaching the test  java file which I wrote to test it.
This might be a very minor differece but still surprising considering the fact that ConcurrentSkipListMap
uses fancy 2 level indexes which they say improves the deletion performance.

And here are the details about the test run.
100 Threads each fetching 1,000,000 records
100 threads each adding 1,000,000 records.
100 threads each deletin 1,000,000 records
( Reads, Writes and deletes simultaneously )

Akash A
On Sun, Oct 23, 2011 at 3:25 AM, Stack <stack@duboce.net<mailto:stack@duboce.net>>
On Sat, Oct 22, 2011 at 2:41 PM, N Keywal <nkeywal@gmail.com<mailto:nkeywal@gmail.com>>
> I would think that the bottleneck for insert is the wal part?
> It would be possible to do a kind of memory list preparation during the wal
> insertion, and if the wal insertion is confirmed, do the insertion in the
> memory list. But it's strange to have the insertion in memory important vs.
> the insertion on disk...
Yes, WAL is the long pole writing.  But MemStore has issues too;
Dhruba says cpu above.  Reading and writing it is also 'slow'.

View raw message