Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of jeff.pubmail@gmail.com
 designates 209.85.217.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAF1+Vs8CdjKDNLgz5DEwVX9=ojf149VOFUexBLsdy02HHZVJog@mail.gmail.com>
References: 
 <CABkzPpTDKjQpdyj6D6a9Fq8nED_R7uh56H9DTfCg8sCbHcH4jQ@mail.gmail.com>
	<CAK3hZ7SPgf+3-TMzVB0Gu75Ok6juzGn6w7vLEUDTVXeMqA57ig@mail.gmail.com>
	<CANiuQZe4F0O-RjWTM-6JUp8S3-Z-TPm2hm7z6=1m1ZydOQYPSA@mail.gmail.com>
	<CAF1+Vs8CdjKDNLgz5DEwVX9=ojf149VOFUexBLsdy02HHZVJog@mail.gmail.com>
Date: Wed, 28 Nov 2012 09:55:55 -0800
Message-ID: 
 <CABkzPpQGY+MgpXh7PaEcyq1SLqy3MzSoqwj5GvTA0vwfwiLeug@mail.gmail.com>
Subject: Re: Best practice for storage of data that changes
From: jeff l <jeff.pubmail@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d040713e3cdbc7f04cf91e036

--f46d040713e3cdbc7f04cf91e036
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <anilgupta84@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

--f46d040713e3cdbc7f04cf91e036
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi,<br><div><br></div><div>I have quite a bit of experience with RDBMSs ( O=
racle, Postgres, Mysql ) and MongoDB but don&#39;t feel any are quite right=
 for this problem. =A0The amount of data being stored and access requiremen=
ts just don&#39;t match up well.</div>
<div><br></div><div>I was hoping to keep the stack as simple as possible an=
d just use hdfs but everything I was seeing kept pointing to the need for s=
ome other datastore. =A0I&#39;ll check out both HBase and Cassandra.</div>
<div><br></div><div>Thanks for the feedback.</div><div class=3D"gmail_extra=
"><br><br><div class=3D"gmail_quote">On Sun, Nov 25, 2012 at 1:11 PM, anil =
gupta <span dir=3D"ltr">&lt;<a href=3D"mailto:anilgupta84@gmail.com" target=
=3D"_blank">anilgupta84@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jeff,<br><br>My two cents below:<br><br>1=
st use case:  Append-only data - e.g. weblogs or user logins<br>As others h=
ave already mentioned that Hadoop is suitable enough to store append only d=
ata. If you want to do analysis of weblogs or user logins then Hadoop is a =
suitable solution for it.<br>


<br><br>2nd use case:  Account/User data<br>First, of all i would suggest y=
ou to have a look at your use case then analyze whether it really needs a N=
oSql solution or not. <br>As you were talking about maintaining User Data i=
n NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql =
features are the selling points for you?<br>


<br>For real time read writes you can have a look at Cassandra or HBase. Bu=
t, i would suggest you to have a very close look at both of them because bo=
th of them have their own advantages. So, the choice will be dependent on y=
our use case. <br>


<br>One added advantage with HBase is that it has a deeper integration with=
 Hadoop ecosystem so you can do a lot of stuff on HBase data=A0 using Hadoo=
p Tools. HBase has integration with Hive querying but AFAIK it has some lim=
itations.<br>


<br>HTH,<br>Anil Gupta<br><div class=3D"gmail_extra"><div><div class=3D"h5"=
><br><br><div class=3D"gmail_quote">On Sun, Nov 25, 2012 at 4:52 AM, Mahesh=
 Balija <span dir=3D"ltr">&lt;<a href=3D"mailto:balijamahesh.mca@gmail.com"=
 target=3D"_blank">balijamahesh.mca@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jeff,<br><br>=A0=A0=A0=A0=A0=A0=A0 As HDF=
S paradigm is &quot;Write once and read many&quot; you cannot be able to up=
date the files on HDFS.<br>


=A0=A0=A0=A0=A0=A0=A0 But for your problem what you can do is you keep the =
logs/userdata in hdfs with different timestamps.<br>
=A0=A0=A0=A0=A0=A0=A0 Run some mapreduce jobs at certain intervals to extra=
ct required data from those logs and put it to Hbase/Cassandra/Mongodb.<br>=
<br>=A0=A0=A0=A0=A0=A0=A0 Mongodb read performance is quite faster also it =
supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read=
/write the data to Mongodb thru Hadoop-Mapreduce.<br>


=A0=A0=A0=A0 <br>=A0=A0=A0=A0=A0=A0=A0 If you are very specific about updat=
ing the hdfs files directly then you have to use any commercial Hadoop pack=
ages like MapR which supports updating the HDFS files.<br><br>Best,<br>Mahe=
sh Balija,<br>Calsoft Labs.<div>


<div><br>
<br><br><div class=3D"gmail_quote">On Sun, Nov 25, 2012 at 9:40 AM, bharath=
 vissapragada <span dir=3D"ltr">&lt;<a href=3D"mailto:bharathvissapragada19=
90@gmail.com" target=3D"_blank">bharathvissapragada1990@gmail.com</a>&gt;</=
span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jeff,<div><br></div><div>Please look at [=
1] . You can store your data in HBase tables and query them normally just b=
y mapping them to Hive tables. Regarding Cassandra support, please follow J=
IRA [2], its not yet in the trunk I suppose!</div>


<div><br></div><div>[1]=A0<a href=3D"https://cwiki.apache.org/Hive/hbaseint=
egration.html" target=3D"_blank">https://cwiki.apache.org/Hive/hbaseintegra=
tion.html</a></div><div>[2]=A0<a href=3D"https://issues.apache.org/jira/bro=
wse/HIVE-1434" target=3D"_blank">https://issues.apache.org/jira/browse/HIVE=
-1434</a></div>


<div><br></div><div>Thanks,<div><div><br><br><div class=3D"gmail_quote">On =
Sun, Nov 25, 2012 at 2:26 AM, jeff l <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:jeff.pubmail@gmail.com" target=3D"_blank">jeff.pubmail@gmail.com</a>&gt;<=
/span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi All,<div><br></div><div>I&#39;m coming fr=
om the RDBMS world and am looking at hdfs for long term data storage and an=
alysis.</div>


<div><br></div><div>I&#39;ve done some research and set up some smallish hd=
fs clusters with hive for testing but I&#39;m having a little trouble under=
standing how everything fits together and was hoping someone could point me=
 in the right direction.</div>


<div><br></div><div>I&#39;m looking at storing two types of data:</div><div=
><br></div><div>1. Append-only data - e.g. weblogs or user logins</div><div=
>2. Account/User data</div><div><br></div><div>HDFS seems to be perfect for=
 append-only data like #1, but I&#39;m having trouble figuring out what to =
do with data that may change frequently.</div>


<div><br></div><div>A simple example would be user data where various bits =
of information: email, etc may change from day to day. =A0Would hbase or ca=
ssandra be the better way to go for this type of data, and can I overlay hi=
ve over all ( hdfs, hbase, cassandra ) so that I can query the data through=
 a single interface?</div>


<div><br></div>
<div>Thanks in advance for any help.</div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Regards,<br>Bharath .V<br>w:<a href=3D"http:/=
/researchweb.iiit.ac.in/%7Ebharath.v" target=3D"_blank">http://researchweb.=
iiit.ac.in/~bharath.v</a><br>


</font></span></div>
</blockquote></div><br>
</div></div></blockquote></div><br><br clear=3D"all"><br></div></div><span =
class=3D"HOEnZb"><font color=3D"#888888">-- <br>Thanks &amp; Regards,<br>An=
il Gupta<br>
</font></span></div>
</blockquote></div><br></div>

--f46d040713e3cdbc7f04cf91e036--