hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Seigal <selek...@yahoo.com>
Subject Re: Dynamic Data Sets
Date Fri, 15 Apr 2011 00:55:02 GMT
How does HBase compare to Hive when it comes to dynamic data sets ?
Does Hive support multi version concurrency control ? I am new to
Hadoop, hence trying to get an idea of how to evaluate these different
technologies and provide concrete justifications on why to choose one
over the other.

Also, I am not interested in how a state changes over time. I am only
interested in what the current state of a data unit is, and then
aggregate with other data with the same state based on a time range
(5000 records exist in state A on April 14th, 2000 records exist in
state B on April 13th etc). The analysis will vary depending on how
the state changes over time.


On Thu, Apr 14, 2011 at 12:19 PM, Michel Segel
<michael_segel@hotmail.com> wrote:
> Sorry,
> It appears to be a flock of us...
>
> Ok bad pun...
>
> I didn't see Ted's response but it looks like we're thinking along the same lines of
thought.
> I was going to ask about that... But it's really a moot point. The size of the immutable
data set doesn't really matter.  The solution would be the same. Consider it some blob which
is >= the size of a SHA-1 hash value. In fact that could be your unique key.
>
> So you get your blob, timestamp and then state value. You hash the blob, store the blob
in one table using the hash as the key value, and then store the state in a column where the
timestamp as the column name and the hash value as the row key. Two separate tables because
if you stored them as separate column families you may have some performance issues due to
a size difference in column families.
>
> This would be a pretty straight forward solution in hbase.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 14, 2011, at 12:18 PM, James Seigel Tynt <james@tynt.com> wrote:
>
>> If all the seigel/seigal/segel gang don't chime in It'd be weird.
>>
>> What size of data are we talking?
>>
>> James
>>
>> On 2011-04-14, at 11:06 AM, Michael Segel <michael_segel@hotmail.com> wrote:
>>
>>>
>>> James,
>>>
>>>
>>> If I understand you get a set of immutable attributes, then a state which can
change.
>>>
>>> If you wanted to use HBase...
>>> I'd say create a unique identifier for your immutable attributes, then store
the unique id, timestamp, and state. Assuming
>>> that you're really interested in looking at the state change over time.
>>>
>>> So what you end up with is one table of immutable attributes, with a unique key,
and then another table where you can use the same unique key and create columns with column
names of time stamps with the state as the value.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> ----------------------------------------
>>>> Date: Wed, 13 Apr 2011 18:12:58 -0700
>>>> Subject: Dynamic Data Sets
>>>> From: selekt86@yahoo.com
>>>> To: common-user@hadoop.apache.org
>>>>
>>>> I have a requirement where I have large sets of incoming data into a
>>>> system I own.
>>>>
>>>> A single unit of data in this set has a set of immutable attributes +
>>>> state attached to it. The state is dynamic and can change at any time.
>>>> What is the best way to run analytical queries on data of such nature
>>>> ?
>>>>
>>>> One way is to maintain this data in a separate store, take a snapshot
>>>> in point of time, and then import into the HDFS filesystem for
>>>> analysis using Hadoop Map-Reduce. I do not see this approach scaling,
>>>> since moving data is obviously expensive.
>>>> If i was to directly maintain this data as Sequence Files in HDFS, how
>>>> would updates work ?
>>>>
>>>> I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
>>>> know that HBase works around this problem through multi version
>>>> concurrency control techniques. Is that the only option ? Are there
>>>> any alternatives ?
>>>>
>>>> Also note that all aggregation and analysis I want to do is time based
>>>> i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
>>>> use cases, is it advisable to use HDFS directly or use systems built
>>>> on top of hadoop like Hive or Hbase ?
>>>
>>
>

Mime
View raw message