hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Latham <lat...@davelink.net>
Subject Re: scaling a low latency service with HBase
Date Mon, 22 Oct 2012 23:30:54 GMT
> Here are a few of my thoughts:
> If possible, you might want to localize your data to a few regions if you can and then
may be have exclusive access to those regions. This way, external load will not impact you.
 I have heard that write penalty of SSDs is quite high. But I think, they will still be better
than spinning disks. Also( I read a while back), with SSDs you get a quota of max possible
writes so if you are write heavy, it may be an issue.

If the data lives only on a few regions, then it is only on a few
servers which means it won't fit in RAM, so it comes back to SSD's.  I
have also read that there is a limited number of lifetime writes on
SSDs, and I'm very interested in how that interacts with HBase's write
pipeline (which was designed for spinning disks).  I would imagine
that each HLog sync would cause an SSD write at each DataNode in the
pipeline.  In that case I would expect wear leveling to give
sufficient life.

> I would presume any solution like cache which is built within Hbase will suffer from
the same issues you described. OTOH, External caching can help but then you need to invest
there and maintain cache to source consistency - might be another issue.
> If you are just doing KV lookups and no ranges, why don't just use KV stores like Cassandra
or may be explore other Nosql solns like Mongo etc?

Are there any that would have particular advantages when it came to SSDs?

> If your data lookups exhibits temporal locality, external, client side cache pools may
> My 2c,
> Abhishek
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
> Sent: Friday, October 19, 2012 4:31 PM
> To: user@hbase.apache.org
> Subject: scaling a low latency service with HBase
> I need to scale an internal service / datastore that is currently hosted on an HBase
cluster and wanted to ask for advice from anyone out there who may have some to share.  The
service does simple key value lookups on 20 byte keys to 20-40 byte values.  It currently
has about 5 billion entries (200GB), and processes about 40k random reads per second, and
about 2k random writes per second.  It currently delivers a median response at 2ms, 90% at
20ms, 99% at 200ms, 99.5% at 5000ms - but the mean is 58ms which is no longer meeting our
needs very well.  It is persistent and highly available.  I need to measure its working set
more closely, but I believe that around 20-30% (randomly distributed) of the data is accessed
each day.  I want a system that can scale to at least 10x current levels (50 billion entries
- 2TB, 400k requests per second) and achieve a mean < 5ms (ideally 1-2ms) and 99.5% <
50ms response time for reads while maintaining persistence and reasonably high availability
(99.9%).  Writes would ideally be in the same as range but we could probably tolerate a mean
more in the 20-30ms range.
> Clearly for that latency, spinning disks won't cut it.  The current service is running
out of an hbase cluster that is shared with many other things and when those other things
hit the disk and network hard is when it degrades.  The cluster has hundreds of nodes and
this data is fits in a small slice of block cache across most of them.  The concerns are that
its performance is impacted by other loads and that as it continues to grow there may not
be enough space in the current cluster's shared block cache.
> So I'm looking for something that will serve out of memory (backed by disk for persistence)
or from SSDs.  A few questions that I would love to hear answers for:
>  - Does HBase sound like a good match as this grows?
>  - Does anyone have experience running HBase over SSDs?  What sort of latency and requests
per second have you been able to achieve?
>  - Is anyone using a row cache on top of (or built into) HBase?  I think there's been
a bit of discussion on occasion but it hasn't gone very far.
> There would be some overhead for each row.  It seems that if we were to continue to rely
on memory + disks this could reduce the memory required.
>  - Does anyone have alternate suggestions for such a service?
> Dave

View raw message