Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of im_gumby@hotmail.com designates
 65.55.111.145 as permitted sender)
Message-ID: <BLU0-SMTP17826A940FC6B65ED060DF9F0AD0@phx.gbl>
Subject: Re: Dynamic Data Sets
References: <BANLkTik3znK6AG1yuEEdEyzgJuPhf1Qw2w@mail.gmail.com>
 <COL117-W632ED15186F9C217DA08BF8FAD0@phx.gbl>
 <6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com>
From: Michel Segel <michael_segel@hotmail.com>
Content-Type: text/plain; charset="us-ascii"
In-Reply-To: <6B870DC5-B5EA-4F9A-BF77-B824C1361958@tynt.com>
Date: Thu, 14 Apr 2011 14:19:22 -0500
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0 (iPad Mail 8G4)
Sender: <im_gumby@hotmail.com>

Sorry,
It appears to be a flock of us...

Ok bad pun...

I didn't see Ted's response but it looks like we're thinking along the same l=
ines of thought.
I was going to ask about that... But it's really a moot point. The size of t=
he immutable data set doesn't really matter.  The solution would be the same=
. Consider it some blob which is >=3D the size of a SHA-1 hash value. In fac=
t that could be your unique key.

So you get your blob, timestamp and then state value. You hash the blob, sto=
re the blob in one table using the hash as the key value, and then store the=
 state in a column where the timestamp as the column name and the hash value=
 as the row key. Two separate tables because if you stored them as separate c=
olumn families you may have some performance issues due to a size difference=
 in column families.

This would be a pretty straight forward solution in hbase.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <james@tynt.com> wrote:

> If all the seigel/seigal/segel gang don't chime in It'd be weird.=20
>=20
> What size of data are we talking?
>=20
> James
>=20
> On 2011-04-14, at 11:06 AM, Michael Segel <michael_segel@hotmail.com> wrot=
e:
>=20
>>=20
>> James,
>>=20
>>=20
>> If I understand you get a set of immutable attributes, then a state which=
 can change.=20
>>=20
>> If you wanted to use HBase...=20
>> I'd say create a unique identifier for your immutable attributes, then st=
ore the unique id, timestamp, and state. Assuming=20
>> that you're really interested in looking at the state change over time.
>>=20
>> So what you end up with is one table of immutable attributes, with a uniq=
ue key, and then another table where you can use the same unique key and cre=
ate columns with column names of time stamps with the state as the value.
>>=20
>> HTH
>>=20
>> -Mike
>>=20
>>=20
>> ----------------------------------------
>>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>>> Subject: Dynamic Data Sets
>>> From: selekt86@yahoo.com
>>> To: common-user@hadoop.apache.org
>>>=20
>>> I have a requirement where I have large sets of incoming data into a
>>> system I own.
>>>=20
>>> A single unit of data in this set has a set of immutable attributes +
>>> state attached to it. The state is dynamic and can change at any time.
>>> What is the best way to run analytical queries on data of such nature
>>> ?
>>>=20
>>> One way is to maintain this data in a separate store, take a snapshot
>>> in point of time, and then import into the HDFS filesystem for
>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>>> since moving data is obviously expensive.
>>> If i was to directly maintain this data as Sequence Files in HDFS, how
>>> would updates work ?
>>>=20
>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>>> know that HBase works around this problem through multi version
>>> concurrency control techniques. Is that the only option ? Are there
>>> any alternatives ?
>>>=20
>>> Also note that all aggregation and analysis I want to do is time based
>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>>> use cases, is it advisable to use HDFS directly or use systems built
>>> on top of hadoop like Hive or Hbase ?
>>=20
>=20