Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 9766 invoked from network); 24 Apr 2010 19:59:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Apr 2010 19:59:56 -0000 Received: (qmail 69191 invoked by uid 500); 24 Apr 2010 19:59:56 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 69139 invoked by uid 500); 24 Apr 2010 19:59:55 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 69131 invoked by uid 99); 24 Apr 2010 19:59:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Apr 2010 19:59:55 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ryanobjc@gmail.com designates 209.85.222.187 as permitted sender) Received: from [209.85.222.187] (HELO mail-pz0-f187.google.com) (209.85.222.187) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Apr 2010 19:59:49 +0000 Received: by pzk17 with SMTP id 17so178828pzk.5 for ; Sat, 24 Apr 2010 12:59:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ZssFyc83AiZbh85vKAtFS6l/bHkIXGhvyqIu2pPR/W4=; b=K3IdsGsRhEpUvAJHQjdWHA9bN0PGVSkE43gGmtaODsLFc1/VRHgRV001J5PR1BUk+e GO6aJQevxn1UPcyCtpRVqj4r+2QkZm6WiO6rclyOP4FEF3+erFvoOOtoukc/Lzp25IBb Rk6ttUUZKEZqrncVpUIt15dRIZU3cB5lZkZUg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=fDGGpcPBmpZLB2D9UHD4LvjRebYCSCJKpU7e+wy4/BYMx8aMoWqfJBismtbA1f3bxm RafnMhNJqOTHVay5iKfPfb9JPge3xghIE1jITUwCu6rLNwg0I48zoEC2gOPb8ICAXXux BJtbyE3LICu0tSEmGUiCZfQoV+FArFQeBV0f8= MIME-Version: 1.0 Received: by 10.140.247.18 with SMTP id u18mr1817927rvh.36.1272139167386; Sat, 24 Apr 2010 12:59:27 -0700 (PDT) Received: by 10.141.48.11 with HTTP; Sat, 24 Apr 2010 12:59:27 -0700 (PDT) In-Reply-To: References: Date: Sat, 24 Apr 2010 12:59:27 -0700 Message-ID: Subject: Re: Modeling column families From: Ryan Rawson To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Sat, Apr 24, 2010 at 12:22 AM, Andrey Stepachev wrote= : > 2010/4/24 Andrew Nguyen > >> Hello all, >> >> Each row key is of the form "PatientName-PhysiologicParameter" and each >> column name is the timestamp of the reading. >> > > With such design in hbase (in opposite to cassandra) you should use row > filters to get only part of data (for example last year) or use client > filtering with row scan. > If data series will be big (>100) you will run in issue of infra row > scanning https://issues.apache.org/jira/browse/HBASE-1537, > as I did. Another issue, as mentioned before, is scaling. Hbase splits da= ta > by rows. > > =EEou have to figure out how much data will be in a row, and if it counts= to > hundreds, use compound key (patient-code-date), > If they are small, may be more easy to use will be (patient-code) because > you can use Get operations with locks (if you need them), and in case of > dated key, you can't (because scan doesn't yet honor locks). This statement is happily obsolete - 0.20.4 RC has new code that makes it so that Gets and Scans never return partially updated rows. I dislike the term 'honor locks' because it implies an implementation strategy, and in this case Gets (which are now 1 row scans) and Scans do not acquire locks to accomplish their tasks. This is important because if you acquired a row lock (which is exclusive) you would only be able to have 1 read and write operation at a time, whereas we really want 1 write operation and as many read operations. I really like compound keys because they are a well understood data modeling problem. People sometimes freak out when they think about endlessly wide rows, and having this data modeling abstraction really helps buffer the transition from a relational DB to a non-relational datastore. I think you can do it in either way, but I prefer compound keys and tall tables when the number of operations per user is expected to be very big. For example if you are storing timeseries data for a monitoring system, you might want to store it by row, since the number of points for a single system might be arbitrarily large (think: 2 years+ of data). In this case if the expected data set size per row is larger than what a single machine could conceivably store, Cassandra would not work for you in this case (since each row must be stored on a single (er N) node(s)). > > >> Give me all blood pressures for Bob between two dates >> Give me all blood pressures, and intracranial pressures for Bob from >> until present >> > > Looks like patient-code-date is preferred way. In you case model can be: > patient-code-date -> series:value. > > >> In other words, the queries will be very patient-centric, or >> patient-physiologic parameter-centric. >> >> Thanks, >> Andrew >