hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11682) Explain hotspotting
Date Mon, 18 Aug 2014 21:13:19 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101308#comment-14101308
] 

Jonathan Hsieh commented on HBASE-11682:
----------------------------------------

{code}
+      <para>Salting in this sense has nothing to do with cryptography, but refers to
adding random
+        data to the start of a row key. In this case, salting refers to adding a prefix to
the row
+        key to cause it to sort differently than it otherwise would. Salting can be helpful
if you
+        have a few keys that come up over and over, along with other rows that don't fit
those keys.
+        In that case, the regions holding rows with the "hot" keys would be overloaded, compared
to
+        the other regions. Salting completely removes ordering, so is often a poorer choice
than
+        hashing. Using totally random row keys for data which is accessed sequentially would
remove
+        the benefit of HBase's row-sorting algorithm and cause very poor performance, as
each get or
+        scan would need to query all regions.</para>
{code}

I don't think this salting example is correct about the ramifications.  Both Nick and I agree
that salting is puting some random value in front of the actual value.  This means instead
of one sorted list of entries, we'd have many n sorted lists of entries if the cardinality
of the salt is n.

Example:  naively we have rowkeys like this:

foo0001
foo0002
foo0003
foo0004

if we us a 4 way salt (a,b,c,d), we could end up with data resorted like this:

a-foo0003
b-foo0001
c-foo0004
d-foo0002

Let say we add some new values to row foo0003.  It could get salted with a new salt, let's
say 'c'.

a-foo0003
b-foo0001
*c-foo0003*
c-foo0004
d-foo0002

To read we still could get things read in the original order but we'd have to have a reader
starting from each salt in parallel to get the rows back in order. (and likely need to do
some coalescing of foo0003 to combine the a-foo0003 and c-foo0003 rows back into one.  The
effect here in this situtation is that we could be writing with 4x the throughput now since
we would be on 4 different machines.(assuming that the a, b, c, d are balanced onto different
machines).

Nick's point of view (please correct me if I am wrong) says that you could "salt" the original
row key with a one-way hash so that foo0003 would always get salted with 'a'.  This would
spread rowkeys that are lexicographically close (foo0001 and foo0002) to different machines
that could help reduce contention and increase overall throughput but not allow ever allow
a single row to have 4x the throughput like the other approach.

{code}
+      <para>Hashing refers to applying a random one-way function to the row key, such
that a
+        particular row always gets the same arbitrary value applied. This preserves the sort
order
+        so that scans are effective, but spreads out load across a region. One example where
hashing
+        is the right strategy would be if for some reason, a large proportion of rows started
with
+        the same letter. Normally, these would all be sorted into the same region. You can
apply a
+        hash to artificially differentiate them and spread them out.</para>
{code}

Hashing actually totally trashes the sort order -- in fact the goal of hashing is to evenly
disburse entries that are near each other lexicographically as much as possible.

> Explain hotspotting
> -------------------
>
>                 Key: HBASE-11682
>                 URL: https://issues.apache.org/jira/browse/HBASE-11682
>             Project: HBase
>          Issue Type: Task
>          Components: documentation
>            Reporter: Misty Stanley-Jones
>            Assignee: Misty Stanley-Jones
>         Attachments: HBASE-11682-1.patch, HBASE-11682.patch, HBASE-11682.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message