hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric <eric.x...@gmail.com>
Subject Re: Schema Design: Query by time; more on rows versus columns
Date Sat, 01 Jan 2011 10:46:00 GMT
I have little experience with HBase so far, but my feeling says it should
not matter how much rows you store and that it's better to save on cpu time
and bandwidth. HBase will distribute the data evenly over your cluster and
should be very good at making rows accessible quickly by key because it's
cheap to find out where a (key,value) pair lives in the cluster. So you
should exploit that.


2010/12/31 Michael Russo <mjrusso@gmail.com>

> Hi all,
>
> I have a schema design question that I would like to run by the group.
>
> Table "A" has a composite row key containing a user identifier and an item
> identifier.
>
> I need to query this table by time, to determine the latest X new or
> updated
> rows (on a per-user basis).  (This is a "live" query and not part of a
> MapReduce workflow.)
>
> To support this query, I have designed an additional table (Table "B") with
> the following schema:
>
> - row key: composite key, containing a user identifier (fixed width) and a
> reverse order timestamp (fixed width); both parts stored in their binary
> form
> - a single CF
> - each column qualifier is the actual item identifier; this is used to form
> part of the row key for table A
> - cell values are always empty
>
> There are a few additional options re: table B:
>
> 1) use the exact timestamp that the event occurred at for the row key.
>
> or,
>
> 2) set a "max timespan" to control the maximum time delta that can be
> stored
> per row; quantize the timestamp of each row to start/end along these "max
> timespan" boundaries.
>
> The number of rows added or modified in table A will vary on a per-user
> basis, and will range anywhere from 50 to 5000 modifications per user per
> day.  The size of the user base is measured in millions, so there will be
> significant write traffic.
>
> With design 1), almost every addition/ modification will result in a new
> row.
>
> With design 2), the number of rows is dependent upon the timespan; if set
> to
> one day, for example, there will be one row per user per day with ~ 50 to
> 5000 columns.
>
> To satisfy the requisite query, given design 1): perform a scan, specifying
> a start key that is simply the user's identifier, and limit the number of
> rows to X.
>
> Given design 2), the query is less straightforward, because we do not know
> how many columns are contained per row before issuing the scan.  However, I
> can guess roughly how many rows I will need, add a buffer, and limit the
> scan to this number of rows.  Then, given the response, I can sort the
> results client-side by timestamp, and limit to X.
>
> I took a look at the OpenTSDB source code after hearing about it at
> HadoopWorld, and the implementation is similar to the second design
> (quantizing event timestamps and storing multiple events per row), although
> the use case/ queries are different.
>
> I currently have option 2 mostly implemented, but I wanted to run this by
> the community to make sure that I was on the right track. Is design 2) the
> better choice?  The query is not as clean as that in design 1), and it
> requires extra bandwidth per query and client-side CPU for sorting, but it
> very significantly reduces the number of required rows (by several orders
> of
> magnitude).  Alternatively, is there an even better implementation path
> that
> I have not considered?
>
> Thanks all for any help and advice.
>
> Happy New Year,
> Michael
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message