Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 3898 invoked from network); 16 Jun 2010 23:22:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Jun 2010 23:22:30 -0000 Received: (qmail 89130 invoked by uid 500); 16 Jun 2010 23:22:29 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 89069 invoked by uid 500); 16 Jun 2010 23:22:29 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 89054 invoked by uid 99); 16 Jun 2010 23:22:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jun 2010 23:22:28 +0000 X-ASF-Spam-Status: No, hits=-1.9 required=10.0 tests=AWL,RCVD_IN_DNSWL_MED X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of agsharma@ebay.com) Received: from [216.33.244.7] (HELO rhv-mipot-002.corp.ebay.com) (216.33.244.7) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jun 2010 23:22:23 +0000 DomainKey-Signature: s=corp; d=ebay.com; c=nofws; q=dns; h=X-EBay-Corp:X-IronPort-AV:Received:Received:From:To:Date: Subject:Thread-Topic:Thread-Index:Message-ID:References: In-Reply-To:Accept-Language:Content-Language: X-MS-Has-Attach:X-MS-TNEF-Correlator:acceptlanguage: x-ems-proccessed:x-ems-stamp:Content-Type: Content-Transfer-Encoding:MIME-Version:X-CFilter; b=rJGVv90o1Gp1WNyKpMbah0CBbjtPtunSqSMAKJsWu3nFiojmJRk+FsZ5 qDxF+123csUmXUPnWvGrv8x2HAuyUtvDZ/yf4IH9Fo01mA2npwTa1CSrH dkUaz/aeJ5SZEtE; DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=ebay.com; i=agsharma@ebay.com; q=dns/txt; s=corp; t=1276730543; x=1308266543; h=from:to:date:subject:message-id:references:in-reply-to: content-transfer-encoding:mime-version; z=From:=20"Sharma,=20Avani"=20|To:=20"u ser@hbase.apache.org"=20|Date:=20W ed,=2016=20Jun=202010=2016:22:04=20-0700|Subject:=20RE: =20Hbase=20schema=20design=20question=20for=20time=20base d=20data|Message-ID:=20|References:=20=0D=0A=20<5A76F6CE309AD049AAF9A039A3924282022965@s c-mbx04.TheFacebook.com>|In-Reply-To:=20<5A76F6CE309AD049 AAF9A039A3924282022965@sc-mbx04.TheFacebook.com> |Content-Transfer-Encoding:=20quoted-printable |MIME-Version:=201.0; bh=g4ly3wHP8x7fkOVsHotB3H7CfweRcXGUUwXdK7Adavg=; b=auyjcVFC5sC0C1P2WCP7IGbttic2DvKl+XzVxY+PfpPSlFAD04lww3Rj GGHXhz+TUSTnzU2pOAFBOlAV6Hs/l5l1PN20cF+tEGwArZHWDMNdHR49n skzcRB3nkz9TWBL; X-EBay-Corp: Yes X-IronPort-AV: E=Sophos;i="4.53,427,1272870000"; d="scan'208";a="19751173" Received: from rhv-vtenf-001.corp.ebay.com (HELO RHV-MEXHT-003.corp.ebay.com) ([10.112.113.52]) by rhv-mipot-002.corp.ebay.com with ESMTP; 16 Jun 2010 16:22:01 -0700 Received: from RHV-MEXMS-002.corp.ebay.com ([10.245.17.114]) by RHV-MEXHT-003.corp.ebay.com ([10.245.24.102]) with mapi; Wed, 16 Jun 2010 16:22:00 -0700 From: "Sharma, Avani" To: "user@hbase.apache.org" Date: Wed, 16 Jun 2010 16:22:04 -0700 Subject: RE: Hbase schema design question for time based data Thread-Topic: Hbase schema design question for time based data Thread-Index: AcsM8QyCJ2QsmiHDSiG1qR6OqI02pAAduR5AABBoxnA= Message-ID: References: <5A76F6CE309AD049AAF9A039A3924282022965@sc-mbx04.TheFacebook.com> In-Reply-To: <5A76F6CE309AD049AAF9A039A3924282022965@sc-mbx04.TheFacebook.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US x-ems-proccessed: 10SqDH0iR7ekR7SRpKqm5A== x-ems-stamp: zdp3+91I8svW0aeUmUXc9A== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter: Scanned >> Not sure exactly what you mean here but doesn't seem you would really ne= ed a secondary index to do what you want. When using versioning you can al= ways ask for "give me 10 latest versions" or "give me the 100 latest versio= ns that occur after date X". How can I do this on hbase shell as well as API ? Say I want the latest ver= sion before a certain date? -Avani -----Original Message----- From: Jonathan Gray [mailto:jgray@facebook.com]=20 Sent: Wednesday, June 16, 2010 11:40 AM To: user@hbase.apache.org Subject: RE: Hbase schema design question for time based data > Hi, >=20 > I am trying design schema for some data to be moved from HDFS into > HBase for real-time access. > Questions - >=20 > 1. Is the use of new API for bulk upload recommended over old API? If > yes, is the new API stable and is there sample executable code around ? Not sure if there is much sample code in branch but Todd Lipcon has done so= me great work in trunk that includes some example code I believe. There's going to be a short presentation on HFileOutputFormat and bulk load= ing at the HUG on June 30th if you're interested in attending (http://meetu= p.com/hbaseusergroup). In general it came make lots of sense for particular use cases, so sometime= s it is recommended and sometimes not. Depends on the requirements. > 2. The data is over time. I need to be able to retrieve the latest > records before a particular date. Note that I do not know what > timestamp that would be. > I could need a user's profile data from a month or year earlier. How > can this be achieved using Hbase in terms of schema? >=20 > a. If the column values are small in size, can I use > versioning for upto 100 values ? Versioning can be used for thousands or possibly millions of versions of a = single column. There are some performance TODOs related to making TimeRang= e queries more efficient that I am working on that are in the pipeline for = the next couple months. If you're generally reading the more recent versions then performance shoul= d be acceptable. Reading back into some of the older ones will work but is= currently not nearly as efficient as it can be. > b. Should I maintain a secondary index for each date > and the latest date/timestamp when profile data is generated/applicable > to that date? Use this information > to come up with user and timestamp key in the main table which would > have user_ts as row_key and data in the columns ? Not sure exactly what you mean here but doesn't seem you would really need = a secondary index to do what you want. When using versioning you can alway= s ask for "give me 10 latest versions" or "give me the 100 latest versions = that occur after date X". >=20 > c. for the columns, how do I decide between using > multiple columns within a column family or multiple column families? This depends on the read/write patterns. Do the different families have di= fferent access patterns? Do you often read from just one family and not th= e others, or write to just one family and not the others? This would be a = good reason to split up into families. If the data all has a similar acces= s pattern then should probably put them in a single family. Each family is= basically like a table, each is stored separately on disk. I think an in-person discussion would help a lot, since you are local (I am= guessing), see if you can come by the Hackathon or HUG in two weeks and we= can talk more on it. Can then post back to the list once we figure a dece= nt solution to your use case. JG=20