hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: crafting your key - scan vs. get
Date Thu, 18 Oct 2012 20:59:09 GMT
Hi Neil,

Mike summed it up well, as usual. :) Your choices of where to describe this "dimension" of
your data (a one-to-many between users and events) are:

 - one row per event
 - one row per user, with events as columns
 - one row per user, with events as versions on a single cell

The first two are the best choices, since the third is sort of a perversion of the time dimension
(it isn't one thing that's changing, it's many things over time), and might make things counter-intuitive
when combined with deletes, compaction, etc. You can do it, but caveat emptor. :)

Since you have in the 100s or 1000s of events per user, it's reasonable to use the 2nd (columns).
And with 1k cell sizes, even extreme cases (thousands of events) won't kill you.

That said, the main plus you get out of using columns over rows is ACID properties; you could
get & set all the stuff for a single user atomically if it's columns in a single row,
but not if its separate rows. That's nice, but I'm guessing you probably don't need to do
that, and instead would write out the events as they happen (i.e., you would rarely be doing
PUTs for multiple events for the same user at the same time, right?).

In theory, tall tables (the row-wise model) should have a slight performance advantage over
wide tables (the column-wise model), all other things being equal; the shape of the data is
nearly the same, but the row-wise version doesn't have to do any work preserving consistency.
Your informal tests about GET vs SCAN perf seem a little suspect, since a GET is actually
implemented as a one-row SCAN; but the devil's in the details, so if you see that happening
repeatably with data that's otherwise identical, raise it up to the dev list and people should
look at it.

The key thing is to try it for yourself and see. :)


ps - Sorry Mike was rude to you in his response. Your question was well-phrased and not at
all boring. Mike, you can explain all you want, but saying "Your question is boring" is straight
up rude; please don't do that.

From: Neil Yalowitz <neilyalowitz@gmail.com<mailto:neilyalowitz@gmail.com>>
Date: Tue, Oct 16, 2012 at 2:53 PM
Subject: crafting your key - scan vs. get
To: user@hbase.apache.org<mailto:user@hbase.apache.org>

Hopefully this is a fun question.  :)

Assume you could architect an HBase table from scratch and you were
choosing between the following two key structures.


The first structure creates a unique row key for each PUT.  The rows are
events related to a user ID.  There may be up to several hundred events for
each user ID (probably not thousands, an average of perhaps ~100 events per
user).  Each key would be made unique with a reverse-order-timestamp or
perhaps just random characters (we don't particularly care about using ROT
for sorting newest here).

AAAAAA + some-unique-chars

The table will look like this:

key                                   vals  cf:mycf                ts
AAAAAA9999...                 myval1                 1350345600
AAAAAA8888...                 myval2                 1350259200
AAAAAA7777...                 myval3                 1350172800

Retrieving these values will use a Scan with startRow and stopRow.  In
hbase shell, it would look like:

$ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'}


The second structure choice uses only the user ID as the key and relies on
row versions to store all the events.  For example:

key                           vals   cf:mycf                     ts
AAAAAA                    myval1                       1350345600
AAAAAA                    myval2                       1350259200
AAAAAA                    myval3                       1350172800

Retrieving these values will use a Get with VERSIONS = somebignumber.  In
hbase shell, it would look like:

$ get 'mytable','AAAAAA',{COLUMN=>'cf:mycf', VERSIONS=>999}

...although this probably violates a comment in the HBase documentation:

"It is not recommended setting the number of max versions to an exceedingly
high level (e.g., hundreds or more) unless those old values are very dear
to you because this will greatly increase StoreFile size."

...found here: http://hbase.apache.org/book/schema.versions.html

So, are there any performance considerations between Scan vs. Get in this
use case?  Which choice would you go for?

Neil Yalowitz

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message