Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ryanobjc@gmail.com designates
 209.85.222.187 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=fDGGpcPBmpZLB2D9UHD4LvjRebYCSCJKpU7e+wy4/BYMx8aMoWqfJBismtbA1f3bxm
         RafnMhNJqOTHVay5iKfPfb9JPge3xghIE1jITUwCu6rLNwg0I48zoEC2gOPb8ICAXXux
         BJtbyE3LICu0tSEmGUiCZfQoV+FArFQeBV0f8=
MIME-Version: 1.0
In-Reply-To: <p2w97d725831004240022me3990d8cyb3b6e783f118b63d@mail.gmail.com>
References: <F5750583-2318-437F-B15D-5EB9EDD4DBDA@ucsfcti.org>
	 <p2w97d725831004240022me3990d8cyb3b6e783f118b63d@mail.gmail.com>
Date: Sat, 24 Apr 2010 12:59:27 -0700
Message-ID: <h2x78568af11004241259s50c7da83nb2e89e42ccfb2bcd@mail.gmail.com>
Subject: Re: Modeling column families
From: Ryan Rawson <ryanobjc@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: quoted-printable

On Sat, Apr 24, 2010 at 12:22 AM, Andrey Stepachev <octo47@gmail.com> wrote=
:
> 2010/4/24 Andrew Nguyen <andrew-lists-hbase@ucsfcti.org>
>
>> Hello all,
>>
>> Each row key is of the form "PatientName-PhysiologicParameter" and each
>> column name is the timestamp of the reading.
>>
>
> With such design in hbase (in opposite to cassandra) you should use row
> filters to get only part of data (for example last year) or use client
> filtering with row scan.
> If data series will be big (>100) you will run in issue of infra row
> scanning https://issues.apache.org/jira/browse/HBASE-1537,
> as I did. Another issue, as mentioned before, is scaling. Hbase splits da=
ta
> by rows.
>
> =EEou have to figure out how much data will be in a row, and if it counts=
 to
> hundreds, use compound key (patient-code-date),
> If they are small, may be more easy to use will be (patient-code) because
> you can use Get operations with locks (if you need them), and in case of
> dated key, you can't (because scan doesn't yet honor locks).

This statement is happily obsolete - 0.20.4 RC has new code that makes
it so that Gets and Scans never return partially updated rows. I
dislike the term 'honor locks' because it implies an implementation
strategy, and in this case Gets (which are now 1 row scans) and Scans
do not acquire locks to accomplish their tasks.  This is important
because if you acquired a row lock (which is exclusive) you would only
be able to have 1 read and write operation at a time, whereas we
really want 1 write operation and as many read operations.

I really like compound keys because they are a well understood data
modeling problem. People sometimes freak out when they think about
endlessly wide rows, and having this data modeling abstraction really
helps buffer the transition from a relational DB to a non-relational
datastore.  I think you can do it in either way, but I prefer compound
keys and tall tables when the number of operations per user is
expected to be very big.

For example if you are storing timeseries data for a monitoring
system, you might want to store it by row, since the number of points
for a single system might be arbitrarily large (think: 2 years+ of
data). In this case if the expected data set size per row is larger
than what a single machine could conceivably store, Cassandra would
not work for you in this case (since each row must be stored on a
single (er N) node(s)).


>
>
>> Give me all blood pressures for Bob between two dates
>> Give me all blood pressures, and intracranial pressures for Bob from <da=
te>
>> until present
>>
>
> Looks like patient-code-date is preferred way. In you case model can be:
> patient-code-date -> series:value.
>
>
>> In other words, the queries will be very patient-centric, or
>> patient-physiologic parameter-centric.
>>
>> Thanks,
>> Andrew
>