hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Pallas <joseph.pal...@oracle.com>
Subject Re: HBase column performance and schema design advice
Date Tue, 26 Apr 2011 22:40:27 GMT
I got into more detail on IRC with jdcryans after sending this message, and he clarified that
the behavior I'm seeing is not what's expected.  He came up with a theory about the cause
and a way to test it.  The theory was that this is a problem with MemStore and the in-memory
representation of recent updates.  The test was to do a flush of the table after doing all
the updates and before doing the retrievals.  My test confirmed his theory: after the flush,
the average retrieval times are constant for both rows and columns.

J-D directed me to HBASE-3484 and, as suggested by St^Ack, I added a comment to it about what
I saw.

So, the bottom line is that getting a single qualifier from a wide row should not, in general,
depend on the width of the row, and if it does, it's because of a performance bug.  Also,
this shouldn't be an issue in normal operation; the test program specifically created a case
with lots of updates followed immediately by retrievals.


On Apr 26, 2011, at 11:29 AM, Joe Pallas wrote:

> I could use some additional feedback on what I should expect from HBase performance and
how it affects schema design ("additional" because I had a brief exchange on IRC about this
> In short: we tried to load some data into our test system, and found we were spending
a lot of time in HTable.exists.  The client was coded with some redundant checks that things
did or did not exist.  (In fairness, those checks probably make sense outside of the bulk
loading environment.)
> Our test system, running on a not-too-elderly 2.3GHz 8-core Opteron, used 4-byte row
keys for the narrow test and 4-byte qualifiers for the wide test. The narrow test shows HTable.get
and HTable.exists are basically constant time, and the wide test shows HTable.get and HTable.exists
are basically linear in the ordinal position of the qualifier.
> The problem?  Extrapolating from that linear time performance for wide rows, if we have
250,000 qualifiers in a wide row, we can expect it to take about a half second to retrieve
the last one.  I had thought that the access time would be logarithmic, rather than linear,
since the qualifiers are all sorted (binary search).  That's not the case.
> Most of the schema design advice I've seen doesn't mention this when comparing rows vs
columns.  But it looks like we can't get all three of random access, (row-level) atomicity,
and scale-independent performance using wide rows.
> Is my analysis off-base?  I'd appreciate any advice on how to trade off those three desires.
> Thanks.
> joe
> PS I tried turning on ROWCOL Bloom filters, and I did not see a change in the performance
of exists for non-existent qualifiers.  Shouldn't I have?  Could I have screwed up my test

View raw message