hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11682) Explain hotspotting
Date Mon, 18 Aug 2014 21:13:19 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101308#comment-14101308

Jonathan Hsieh commented on HBASE-11682:

+      <para>Salting in this sense has nothing to do with cryptography, but refers to
adding random
+        data to the start of a row key. In this case, salting refers to adding a prefix to
the row
+        key to cause it to sort differently than it otherwise would. Salting can be helpful
if you
+        have a few keys that come up over and over, along with other rows that don't fit
those keys.
+        In that case, the regions holding rows with the "hot" keys would be overloaded, compared
+        the other regions. Salting completely removes ordering, so is often a poorer choice
+        hashing. Using totally random row keys for data which is accessed sequentially would
+        the benefit of HBase's row-sorting algorithm and cause very poor performance, as
each get or
+        scan would need to query all regions.</para>

I don't think this salting example is correct about the ramifications.  Both Nick and I agree
that salting is puting some random value in front of the actual value.  This means instead
of one sorted list of entries, we'd have many n sorted lists of entries if the cardinality
of the salt is n.

Example:  naively we have rowkeys like this:


if we us a 4 way salt (a,b,c,d), we could end up with data resorted like this:


Let say we add some new values to row foo0003.  It could get salted with a new salt, let's
say 'c'.


To read we still could get things read in the original order but we'd have to have a reader
starting from each salt in parallel to get the rows back in order. (and likely need to do
some coalescing of foo0003 to combine the a-foo0003 and c-foo0003 rows back into one.  The
effect here in this situtation is that we could be writing with 4x the throughput now since
we would be on 4 different machines.(assuming that the a, b, c, d are balanced onto different

Nick's point of view (please correct me if I am wrong) says that you could "salt" the original
row key with a one-way hash so that foo0003 would always get salted with 'a'.  This would
spread rowkeys that are lexicographically close (foo0001 and foo0002) to different machines
that could help reduce contention and increase overall throughput but not allow ever allow
a single row to have 4x the throughput like the other approach.

+      <para>Hashing refers to applying a random one-way function to the row key, such
that a
+        particular row always gets the same arbitrary value applied. This preserves the sort
+        so that scans are effective, but spreads out load across a region. One example where
+        is the right strategy would be if for some reason, a large proportion of rows started
+        the same letter. Normally, these would all be sorted into the same region. You can
apply a
+        hash to artificially differentiate them and spread them out.</para>

Hashing actually totally trashes the sort order -- in fact the goal of hashing is to evenly
disburse entries that are near each other lexicographically as much as possible.

> Explain hotspotting
> -------------------
>                 Key: HBASE-11682
>                 URL: https://issues.apache.org/jira/browse/HBASE-11682
>             Project: HBase
>          Issue Type: Task
>          Components: documentation
>            Reporter: Misty Stanley-Jones
>            Assignee: Misty Stanley-Jones
>         Attachments: HBASE-11682-1.patch, HBASE-11682.patch, HBASE-11682.patch

This message was sent by Atlassian JIRA

View raw message