Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 46307 invoked from network); 15 Apr 2011 00:55:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Apr 2011 00:55:33 -0000 Received: (qmail 56624 invoked by uid 500); 15 Apr 2011 00:55:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 56582 invoked by uid 500); 15 Apr 2011 00:55:30 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 56574 invoked by uid 99); 15 Apr 2011 00:55:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Apr 2011 00:55:30 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=FORGED_YAHOO_RCVD,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saurabh.r.s@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Apr 2011 00:55:23 +0000 Received: by qyk30 with SMTP id 30so1810202qyk.14 for ; Thu, 14 Apr 2011 17:55:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=YyKtJTQu0sv1R716InPs3/OStpyiSBVmL86GUpSGLaA=; b=hQjyQ5/c36S7BgCXno0lQh5PAlDRHBrZrg9jK9E1CqlDqzTIZLxnn68513eCTnIs6I JFIBYC1YwoZc3ne7v0JBgu1pZ348JIzQXgk/h5/eAe+HPYCLFg0xX8Cgqi/5mnzY0Nob 7niTSpJh8KMnzQoYLOa0sf7tSP4cXrhKzAFP4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=bMI0R7ptdmLm4GI6RwI8apQjvJ0nHx5ljhqicOaPaxZMUpR34SLInYivsUlepokM+k gU4qcjshsEB47hP5i5hGuQD+1AGZbPqJWerVBn4d5tDMB8g0+0LSRcYxIk7bh4FMOkJU PnlRux+9x4UQbNhU+r8TuhbNEfNvtBIicisKw= MIME-Version: 1.0 Received: by 10.229.205.92 with SMTP id fp28mr1044065qcb.213.1302828902941; Thu, 14 Apr 2011 17:55:02 -0700 (PDT) Sender: saurabh.r.s@gmail.com Received: by 10.229.51.134 with HTTP; Thu, 14 Apr 2011 17:55:02 -0700 (PDT) In-Reply-To: References: <6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com> Date: Thu, 14 Apr 2011 17:55:02 -0700 X-Google-Sender-Auth: cBw6Nu7UHQ1ftw1m9a7ZoRn9JTk Message-ID: Subject: Re: Dynamic Data Sets From: Sam Seigal To: common-user@hadoop.apache.org Cc: Michel Segel Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org How does HBase compare to Hive when it comes to dynamic data sets ? Does Hive support multi version concurrency control ? I am new to Hadoop, hence trying to get an idea of how to evaluate these different technologies and provide concrete justifications on why to choose one over the other. Also, I am not interested in how a state changes over time. I am only interested in what the current state of a data unit is, and then aggregate with other data with the same state based on a time range (5000 records exist in state A on April 14th, 2000 records exist in state B on April 13th etc). The analysis will vary depending on how the state changes over time. On Thu, Apr 14, 2011 at 12:19 PM, Michel Segel wrote: > Sorry, > It appears to be a flock of us... > > Ok bad pun... > > I didn't see Ted's response but it looks like we're thinking along the sa= me lines of thought. > I was going to ask about that... But it's really a moot point. The size o= f the immutable data set doesn't really matter. =A0The solution would be th= e same. Consider it some blob which is >=3D the size of a SHA-1 hash value.= In fact that could be your unique key. > > So you get your blob, timestamp and then state value. You hash the blob, = store the blob in one table using the hash as the key value, and then store= the state in a column where the timestamp as the column name and the hash = value as the row key. Two separate tables because if you stored them as sep= arate column families you may have some performance issues due to a size di= fference in column families. > > This would be a pretty straight forward solution in hbase. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 14, 2011, at 12:18 PM, James Seigel Tynt wrote: > >> If all the seigel/seigal/segel gang don't chime in It'd be weird. >> >> What size of data are we talking? >> >> James >> >> On 2011-04-14, at 11:06 AM, Michael Segel wr= ote: >> >>> >>> James, >>> >>> >>> If I understand you get a set of immutable attributes, then a state whi= ch can change. >>> >>> If you wanted to use HBase... >>> I'd say create a unique identifier for your immutable attributes, then = store the unique id, timestamp, and state. Assuming >>> that you're really interested in looking at the state change over time. >>> >>> So what you end up with is one table of immutable attributes, with a un= ique key, and then another table where you can use the same unique key and = create columns with column names of time stamps with the state as the value= . >>> >>> HTH >>> >>> -Mike >>> >>> >>> ---------------------------------------- >>>> Date: Wed, 13 Apr 2011 18:12:58 -0700 >>>> Subject: Dynamic Data Sets >>>> From: selekt86@yahoo.com >>>> To: common-user@hadoop.apache.org >>>> >>>> I have a requirement where I have large sets of incoming data into a >>>> system I own. >>>> >>>> A single unit of data in this set has a set of immutable attributes + >>>> state attached to it. The state is dynamic and can change at any time. >>>> What is the best way to run analytical queries on data of such nature >>>> ? >>>> >>>> One way is to maintain this data in a separate store, take a snapshot >>>> in point of time, and then import into the HDFS filesystem for >>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling, >>>> since moving data is obviously expensive. >>>> If i was to directly maintain this data as Sequence Files in HDFS, how >>>> would updates work ? >>>> >>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I >>>> know that HBase works around this problem through multi version >>>> concurrency control techniques. Is that the only option ? Are there >>>> any alternatives ? >>>> >>>> Also note that all aggregation and analysis I want to do is time based >>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such >>>> use cases, is it advisable to use HDFS directly or use systems built >>>> on top of hadoop like Hive or Hbase ? >>> >> >