Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 82233 invoked from network); 17 Jun 2010 19:11:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Jun 2010 19:11:06 -0000 Received: (qmail 87827 invoked by uid 500); 17 Jun 2010 19:11:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 87674 invoked by uid 500); 17 Jun 2010 19:11:04 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 87661 invoked by uid 99); 17 Jun 2010 19:11:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 19:11:04 +0000 X-ASF-Spam-Status: No, hits=1.1 required=10.0 tests=AWL,RCVD_ILLEGAL_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jgray@facebook.com designates 69.63.178.183 as permitted sender) Received: from [69.63.178.183] (HELO mx-out.facebook.com) (69.63.178.183) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 19:10:59 +0000 Received: from [10.18.255.134] ([10.18.255.134:8386] helo=mail.thefacebook.com) by mta014.snc1.facebook.com (envelope-from ) (ecelerity 2.2.2.45 r(34067)) with ESMTP id DE/7B-15816-F237A1C4; Thu, 17 Jun 2010 12:10:39 -0700 Received: from SC-MBX04.TheFacebook.com ([169.254.3.221]) by sc-hub04.TheFacebook.com ([fe80::8df5:7f90:d4a0:bb9%11]) with mapi; Thu, 17 Jun 2010 12:10:38 -0700 From: Jonathan Gray To: "user@hbase.apache.org" Subject: RE: Hbase schema design question for time based data Thread-Topic: Hbase schema design question for time based data Thread-Index: AcsM8QyCJ2QsmiHDSiG1qR6OqI02pAAduR5AABBoxnAAKa1AkA== Date: Thu, 17 Jun 2010 19:10:36 +0000 Message-ID: <5A76F6CE309AD049AAF9A039A39242820248BF@sc-mbx04.TheFacebook.com> References: <5A76F6CE309AD049AAF9A039A3924282022965@sc-mbx04.TheFacebook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 I'm not terribly familiar with the shell API and it does not fully cover th= e Java API (I don't think). Let's say I want the 3 latest versions of rowX, columnY that occur before t= imeT. With the Java API you can do something like: new Get(somerow).setTimeRange(0, timeT).setMaxVersions(3) That means, I want versions in the range from 0 to timeT (before timeT), an= d I only want the 3 latest versions. JG > -----Original Message----- > From: Sharma, Avani [mailto:agsharma@ebay.com] > Sent: Wednesday, June 16, 2010 4:22 PM > To: user@hbase.apache.org > Subject: RE: Hbase schema design question for time based data >=20 > >> Not sure exactly what you mean here but doesn't seem you would > really need a secondary index to do what you want. When using > versioning you can always ask for "give me 10 latest versions" or "give > me the 100 latest versions that occur after date X". >=20 > How can I do this on hbase shell as well as API ? Say I want the latest > version before a certain date? >=20 > -Avani >=20 > -----Original Message----- > From: Jonathan Gray [mailto:jgray@facebook.com] > Sent: Wednesday, June 16, 2010 11:40 AM > To: user@hbase.apache.org > Subject: RE: Hbase schema design question for time based data >=20 > > Hi, > > > > I am trying design schema for some data to be moved from HDFS into > > HBase for real-time access. > > Questions - > > > > 1. Is the use of new API for bulk upload recommended over old API? If > > yes, is the new API stable and is there sample executable code around > ? >=20 > Not sure if there is much sample code in branch but Todd Lipcon has > done some great work in trunk that includes some example code I > believe. >=20 > There's going to be a short presentation on HFileOutputFormat and bulk > loading at the HUG on June 30th if you're interested in attending > (http://meetup.com/hbaseusergroup). >=20 > In general it came make lots of sense for particular use cases, so > sometimes it is recommended and sometimes not. Depends on the > requirements. >=20 >=20 > > 2. The data is over time. I need to be able to retrieve the latest > > records before a particular date. Note that I do not know what > > timestamp that would be. > > I could need a user's profile data from a month or year earlier. > How > > can this be achieved using Hbase in terms of schema? > > > > a. If the column values are small in size, can I use > > versioning for upto 100 values ? >=20 > Versioning can be used for thousands or possibly millions of versions > of a single column. There are some performance TODOs related to making > TimeRange queries more efficient that I am working on that are in the > pipeline for the next couple months. >=20 > If you're generally reading the more recent versions then performance > should be acceptable. Reading back into some of the older ones will > work but is currently not nearly as efficient as it can be. >=20 >=20 > > b. Should I maintain a secondary index for each date > > and the latest date/timestamp when profile data is > generated/applicable > > to that date? Use this information > > to come up with user and timestamp key in the main table which would > > have user_ts as row_key and data in the columns ? >=20 > Not sure exactly what you mean here but doesn't seem you would really > need a secondary index to do what you want. When using versioning you > can always ask for "give me 10 latest versions" or "give me the 100 > latest versions that occur after date X". >=20 > > > > c. for the columns, how do I decide between using > > multiple columns within a column family or multiple column families? >=20 > This depends on the read/write patterns. Do the different families > have different access patterns? Do you often read from just one family > and not the others, or write to just one family and not the others? > This would be a good reason to split up into families. If the data all > has a similar access pattern then should probably put them in a single > family. Each family is basically like a table, each is stored > separately on disk. >=20 > I think an in-person discussion would help a lot, since you are local > (I am guessing), see if you can come by the Hackathon or HUG in two > weeks and we can talk more on it. Can then post back to the list once > we figure a decent solution to your use case. >=20 > JG