hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Modeling column families
Date Sat, 24 Apr 2010 19:59:27 GMT
On Sat, Apr 24, 2010 at 12:22 AM, Andrey Stepachev <octo47@gmail.com> wrote:
> 2010/4/24 Andrew Nguyen <andrew-lists-hbase@ucsfcti.org>
>> Hello all,
>> Each row key is of the form "PatientName-PhysiologicParameter" and each
>> column name is the timestamp of the reading.
> With such design in hbase (in opposite to cassandra) you should use row
> filters to get only part of data (for example last year) or use client
> filtering with row scan.
> If data series will be big (>100) you will run in issue of infra row
> scanning https://issues.apache.org/jira/browse/HBASE-1537,
> as I did. Another issue, as mentioned before, is scaling. Hbase splits data
> by rows.
> Нou have to figure out how much data will be in a row, and if it counts to
> hundreds, use compound key (patient-code-date),
> If they are small, may be more easy to use will be (patient-code) because
> you can use Get operations with locks (if you need them), and in case of
> dated key, you can't (because scan doesn't yet honor locks).

This statement is happily obsolete - 0.20.4 RC has new code that makes
it so that Gets and Scans never return partially updated rows. I
dislike the term 'honor locks' because it implies an implementation
strategy, and in this case Gets (which are now 1 row scans) and Scans
do not acquire locks to accomplish their tasks.  This is important
because if you acquired a row lock (which is exclusive) you would only
be able to have 1 read and write operation at a time, whereas we
really want 1 write operation and as many read operations.

I really like compound keys because they are a well understood data
modeling problem. People sometimes freak out when they think about
endlessly wide rows, and having this data modeling abstraction really
helps buffer the transition from a relational DB to a non-relational
datastore.  I think you can do it in either way, but I prefer compound
keys and tall tables when the number of operations per user is
expected to be very big.

For example if you are storing timeseries data for a monitoring
system, you might want to store it by row, since the number of points
for a single system might be arbitrarily large (think: 2 years+ of
data). In this case if the expected data set size per row is larger
than what a single machine could conceivably store, Cassandra would
not work for you in this case (since each row must be stored on a
single (er N) node(s)).

>> Give me all blood pressures for Bob between two dates
>> Give me all blood pressures, and intracranial pressures for Bob from <date>
>> until present
> Looks like patient-code-date is preferred way. In you case model can be:
> patient-code-date -> series:value.
>> In other words, the queries will be very patient-centric, or
>> patient-physiologic parameter-centric.
>> Thanks,
>> Andrew

View raw message