hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek Wollenstein <de...@klout.com>
Subject Key Design Question for list data
Date Mon, 02 Apr 2012 20:10:47 GMT
We're looking at how to store a large amount of (per-user) list data in
hbase, and we were trying to figure out what kind of access pattern made
the most sense.  One option is store the majority of the data in a key, so
we could have something like
<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we hade was to do this entirely using
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values.
So in one case reading the first thirty values would be
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of
these lists, with infrequent access reading deeper into the lists.  Some
users would have <= 30 total values in these lists, and some users would
have millions (i.e. power-law distribution)

   The single-value format seems like it would take up more space on hbase,
but would offer some improved retrieval / pagination flexibility.  Would
there be any significant performance advantages to be able to paginate via
gets vs paginating with scans?
    My initial understanding was that doing a scan should be faster if our
paging size is unknown (and caching is set appropriately), but that gets
should be faster if we'll always need the same page size.  I've ended up
hearing different people tell me opposite things about performance.  I
assume the page sizes would be relatively consistent, so for most use cases
we could guarantee that we only wanted one page of data in the
fixed-page-length case.  I would also assume that we would have infrequent
updates, but may have inserts into the middle of these lists (meaning we'd
need to update all subsequent rows).

Thanks for help / suggestions / followup questions

--Derek

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message