Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of saurabh.r.s@gmail.com
 designates 209.85.216.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:cc:content-type
         :content-transfer-encoding;
        b=bMI0R7ptdmLm4GI6RwI8apQjvJ0nHx5ljhqicOaPaxZMUpR34SLInYivsUlepokM+k
         gU4qcjshsEB47hP5i5hGuQD+1AGZbPqJWerVBn4d5tDMB8g0+0LSRcYxIk7bh4FMOkJU
         PnlRux+9x4UQbNhU+r8TuhbNEfNvtBIicisKw=
MIME-Version: 1.0
Sender: saurabh.r.s@gmail.com
In-Reply-To: <BLU0-SMTP17826A940FC6B65ED060DF9F0AD0@phx.gbl>
References: <BANLkTik3znK6AG1yuEEdEyzgJuPhf1Qw2w@mail.gmail.com>
	<COL117-W632ED15186F9C217DA08BF8FAD0@phx.gbl>
	<6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com>
	<BLU0-SMTP17826A940FC6B65ED060DF9F0AD0@phx.gbl>
Date: Thu, 14 Apr 2011 17:55:02 -0700
Message-ID: <BANLkTi=eud2KEYWrBjhQKdvVSzbumcyxfA@mail.gmail.com>
Subject: Re: Dynamic Data Sets
From: Sam Seigal <selekt86@yahoo.com>
To: common-user@hadoop.apache.org
Cc: Michel Segel <michael_segel@hotmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

How does HBase compare to Hive when it comes to dynamic data sets ?
Does Hive support multi version concurrency control ? I am new to
Hadoop, hence trying to get an idea of how to evaluate these different
technologies and provide concrete justifications on why to choose one
over the other.

Also, I am not interested in how a state changes over time. I am only
interested in what the current state of a data unit is, and then
aggregate with other data with the same state based on a time range
(5000 records exist in state A on April 14th, 2000 records exist in
state B on April 13th etc). The analysis will vary depending on how
the state changes over time.


On Thu, Apr 14, 2011 at 12:19 PM, Michel Segel
<michael_segel@hotmail.com> wrote:
> Sorry,
> It appears to be a flock of us...
>
> Ok bad pun...
>
> I didn't see Ted's response but it looks like we're thinking along the sa=
me lines of thought.
> I was going to ask about that... But it's really a moot point. The size o=
f the immutable data set doesn't really matter. =A0The solution would be th=
e same. Consider it some blob which is >=3D the size of a SHA-1 hash value.=
 In fact that could be your unique key.
>
> So you get your blob, timestamp and then state value. You hash the blob, =
store the blob in one table using the hash as the key value, and then store=
 the state in a column where the timestamp as the column name and the hash =
value as the row key. Two separate tables because if you stored them as sep=
arate column families you may have some performance issues due to a size di=
fference in column families.
>
> This would be a pretty straight forward solution in hbase.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <james@tynt.com> wrote:
>
>> If all the seigel/seigal/segel gang don't chime in It'd be weird.
>>
>> What size of data are we talking?
>>
>> James
>>
>> On 2011-04-14, at 11:06 AM, Michael Segel <michael_segel@hotmail.com> wr=
ote:
>>
>>>
>>> James,
>>>
>>>
>>> If I understand you get a set of immutable attributes, then a state whi=
ch can change.
>>>
>>> If you wanted to use HBase...
>>> I'd say create a unique identifier for your immutable attributes, then =
store the unique id, timestamp, and state. Assuming
>>> that you're really interested in looking at the state change over time.
>>>
>>> So what you end up with is one table of immutable attributes, with a un=
ique key, and then another table where you can use the same unique key and =
create columns with column names of time stamps with the state as the value=
.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> ----------------------------------------
>>>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>>>> Subject: Dynamic Data Sets
>>>> From: selekt86@yahoo.com
>>>> To: common-user@hadoop.apache.org
>>>>
>>>> I have a requirement where I have large sets of incoming data into a
>>>> system I own.
>>>>
>>>> A single unit of data in this set has a set of immutable attributes +
>>>> state attached to it. The state is dynamic and can change at any time.
>>>> What is the best way to run analytical queries on data of such nature
>>>> ?
>>>>
>>>> One way is to maintain this data in a separate store, take a snapshot
>>>> in point of time, and then import into the HDFS filesystem for
>>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>>>> since moving data is obviously expensive.
>>>> If i was to directly maintain this data as Sequence Files in HDFS, how
>>>> would updates work ?
>>>>
>>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>>>> know that HBase works around this problem through multi version
>>>> concurrency control techniques. Is that the only option ? Are there
>>>> any alternatives ?
>>>>
>>>> Also note that all aggregation and analysis I want to do is time based
>>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>>>> use cases, is it advisable to use HDFS directly or use systems built
>>>> on top of hadoop like Hive or Hbase ?
>>>
>>
>