Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 14773 invoked from network); 14 Apr 2011 20:45:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2011 20:45:07 -0000 Received: (qmail 51459 invoked by uid 500); 14 Apr 2011 20:45:04 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 51428 invoked by uid 500); 14 Apr 2011 20:45:04 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Delivered-To: moderator for common-user@hadoop.apache.org Received: (qmail 44434 invoked by uid 99); 14 Apr 2011 20:38:44 -0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of im_gumby@hotmail.com designates 65.55.111.145 as permitted sender) X-Originating-IP: [173.15.87.37] X-Originating-Email: [im_gumby@hotmail.com] Message-ID: Subject: Re: Dynamic Data Sets References: <6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com> From: Michel Segel Content-Type: text/plain; charset="us-ascii" In-Reply-To: <6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com> Date: Thu, 14 Apr 2011 14:19:22 -0500 To: "common-user@hadoop.apache.org" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 (iPad Mail 8G4) X-Mailer: iPad Mail (8G4) X-OriginalArrivalTime: 14 Apr 2011 20:38:14.0398 (UTC) FILETIME=[E0AA79E0:01CBFAE3] Sender: X-Virus-Checked: Checked by ClamAV on apache.org Sorry, It appears to be a flock of us... Ok bad pun... I didn't see Ted's response but it looks like we're thinking along the same l= ines of thought. I was going to ask about that... But it's really a moot point. The size of t= he immutable data set doesn't really matter. The solution would be the same= . Consider it some blob which is >=3D the size of a SHA-1 hash value. In fac= t that could be your unique key. So you get your blob, timestamp and then state value. You hash the blob, sto= re the blob in one table using the hash as the key value, and then store the= state in a column where the timestamp as the column name and the hash value= as the row key. Two separate tables because if you stored them as separate c= olumn families you may have some performance issues due to a size difference= in column families. This would be a pretty straight forward solution in hbase. Sent from a remote device. Please excuse any typos... Mike Segel On Apr 14, 2011, at 12:18 PM, James Seigel Tynt wrote: > If all the seigel/seigal/segel gang don't chime in It'd be weird.=20 >=20 > What size of data are we talking? >=20 > James >=20 > On 2011-04-14, at 11:06 AM, Michael Segel wrot= e: >=20 >>=20 >> James, >>=20 >>=20 >> If I understand you get a set of immutable attributes, then a state which= can change.=20 >>=20 >> If you wanted to use HBase...=20 >> I'd say create a unique identifier for your immutable attributes, then st= ore the unique id, timestamp, and state. Assuming=20 >> that you're really interested in looking at the state change over time. >>=20 >> So what you end up with is one table of immutable attributes, with a uniq= ue key, and then another table where you can use the same unique key and cre= ate columns with column names of time stamps with the state as the value. >>=20 >> HTH >>=20 >> -Mike >>=20 >>=20 >> ---------------------------------------- >>> Date: Wed, 13 Apr 2011 18:12:58 -0700 >>> Subject: Dynamic Data Sets >>> From: selekt86@yahoo.com >>> To: common-user@hadoop.apache.org >>>=20 >>> I have a requirement where I have large sets of incoming data into a >>> system I own. >>>=20 >>> A single unit of data in this set has a set of immutable attributes + >>> state attached to it. The state is dynamic and can change at any time. >>> What is the best way to run analytical queries on data of such nature >>> ? >>>=20 >>> One way is to maintain this data in a separate store, take a snapshot >>> in point of time, and then import into the HDFS filesystem for >>> analysis using Hadoop Map-Reduce. I do not see this approach scaling, >>> since moving data is obviously expensive. >>> If i was to directly maintain this data as Sequence Files in HDFS, how >>> would updates work ? >>>=20 >>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I >>> know that HBase works around this problem through multi version >>> concurrency control techniques. Is that the only option ? Are there >>> any alternatives ? >>>=20 >>> Also note that all aggregation and analysis I want to do is time based >>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such >>> use cases, is it advisable to use HDFS directly or use systems built >>> on top of hadoop like Hive or Hbase ? >>=20 >=20