Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com
 designates 209.85.210.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <7176940.498.1354235300476.JavaMail.lancenorskog@Lance-Norskogs-MacBook-Pro.local>
References: 
 <CAF1+Vs_U9K4=7HNevT906Yhzze0foDLTLMON-ASwdQMgN1whnA@mail.gmail.com>
 <7176940.498.1354235300476.JavaMail.lancenorskog@Lance-Norskogs-MacBook-Pro.local>
From: anil gupta <anilgupta84@gmail.com>
Date: Fri, 30 Nov 2012 12:35:05 -0800
Message-ID: 
 <CAF1+Vs8P-bt765oqLFd4Lxp2_1Ab+GSMhkwkzSkBGAYUqxORMg@mail.gmail.com>
Subject: Re: Best practice for storage of data that changes
To: user@hbase.apache.org
Cc: goksron@gmail.com, jeff.pubmail@gmail.com
Content-Type: multipart/alternative; boundary=bcaec52999cde6fe6704cfbc56b0

--bcaec52999cde6fe6704cfbc56b0
Content-Type: text/plain; charset=ISO-8859-1

Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <goksron@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <anilgupta84@gmail.com>
> *To: *"common-user@hadoop.apache.org" <user@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <anilgupta84@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase or
>>>>>> cassandra be the better way to go for this type of data, and can I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

--bcaec52999cde6fe6704cfbc56b0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Guys,<br><br>I posted our study on my blog: <a href=3D"http://bigdatanoo=
b.blogspot.com/2012/11/hbase-vs-cassandra.html" target=3D"_blank">http://bi=
gdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html</a><br><br>We ended =
up choosing HBase because:<br>


1. HBase provides Range based scan, and ordered partitioning.<br>2. HBase i=
s closely integrated with Hadoop ecosystem.<br>3. HBase is strongly consist=
ent as compared to Cassandra which is eventually consistent.<br><br>As i sa=
id earlier in my email that selection of NoSql solution depends on the use =
case. There are subtle differences between NoSql solution and each of them =
have their own &quot;Sweet Spot&quot;. So, pick yours after careful evaluat=
ion.<br>

<br>PS: Added the HBase mailing list also since this is more about HBase. <=
br>
<br>Hope This Helps,<br>Anil Gupta<br><br><br><div class=3D"gmail_extra"><d=
iv class=3D"gmail_quote">On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:goksron@gmail.com" target=3D"_blank">g=
oksron@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div style=3D"font-size:10pt;font-famil=
y:arial,helvetica,sans-serif">Please! There are lots of blogs etc. about th=
e two, but very few head-to-head for a real use case.<br>


<br><hr><blockquote style=3D"padding-left:5px;font-size:12pt;font-style:nor=
mal;margin-left:5px;font-family:Helvetica,Arial,sans-serif;text-decoration:=
none;font-weight:normal;border-left:2px solid rgb(16,16,255)"><b>From: </b>=
&quot;anil gupta&quot; &lt;<a href=3D"mailto:anilgupta84@gmail.com" target=
=3D"_blank">anilgupta84@gmail.com</a>&gt;<br>


<b>To: </b>&quot;<a href=3D"mailto:common-user@hadoop.apache.org" target=3D=
"_blank">common-user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:user=
@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&gt;<br><b>=
Sent: </b>Wednesday, November 28, 2012 11:01:55 AM<br>


<b>Subject: </b>Re: Best practice for storage of data that changes<div><div=
><br><br>Hi Jeff,<br><br>At my workplace &quot;Intuit&quot;, we did some de=
tailed study to evaluate HBase and Cassandra for our use case. I will see i=
f i can post the comparative study on my public blog or on this mailing lis=
t.<br>


<br>BTW, What is your use case? What bottleneck are you hitting at current =
solutions? If you can share some details then HBase community will try to h=
elp you out.<br><br>Thanks,<br>Anil Gupta<br><div class=3D"gmail_extra">


<br>
<br><div class=3D"gmail_quote">On Wed, Nov 28, 2012 at 9:55 AM, jeff l <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:jeff.pubmail@gmail.com" target=3D"_blan=
k">jeff.pubmail@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex">


Hi,<br><div><br></div><div>I have quite a bit of experience with RDBMSs ( O=
racle, Postgres, Mysql ) and MongoDB but don&#39;t feel any are quite right=
 for this problem. =A0The amount of data being stored and access requiremen=
ts just don&#39;t match up well.</div>


<div><br></div><div>I was hoping to keep the stack as simple as possible an=
d just use hdfs but everything I was seeing kept pointing to the need for s=
ome other datastore. =A0I&#39;ll check out both HBase and Cassandra.</div>


<div><br></div><div>Thanks for the feedback.</div><div><div><div class=3D"g=
mail_extra"><br><br><div class=3D"gmail_quote">On Sun, Nov 25, 2012 at 1:11=
 PM, anil gupta <span dir=3D"ltr">&lt;<a href=3D"mailto:anilgupta84@gmail.c=
om" target=3D"_blank">anilgupta84@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jeff,<br><br>My two ce=
nts below:<br><br>1st use case:  Append-only data - e.g. weblogs or user lo=
gins<br>


As others have already mentioned that Hadoop is suitable enough to store ap=
pend only data. If you want to do analysis of weblogs or user logins then H=
adoop is a suitable solution for it.<br>


<br><br>2nd use case:  Account/User data<br>First, of all i would suggest y=
ou to have a look at your use case then analyze whether it really needs a N=
oSql solution or not. <br>As you were talking about maintaining User Data i=
n NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql =
features are the selling points for you?<br>


<br>For real time read writes you can have a look at Cassandra or HBase. Bu=
t, i would suggest you to have a very close look at both of them because bo=
th of them have their own advantages. So, the choice will be dependent on y=
our use case. <br>


<br>One added advantage with HBase is that it has a deeper integration with=
 Hadoop ecosystem so you can do a lot of stuff on HBase data=A0 using Hadoo=
p Tools. HBase has integration with Hive querying but AFAIK it has some lim=
itations.<br>


<br>HTH,<br>Anil Gupta<br><div class=3D"gmail_extra"><div><div><br><br><div=
 class=3D"gmail_quote">On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:balijamahesh.mca@gmail.com" target=3D"_b=
lank">balijamahesh.mca@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jeff,<br><br>=A0=A0=A0=
=A0=A0=A0=A0 As HDFS paradigm is &quot;Write once and read many&quot; you c=
annot be able to update the files on HDFS.<br>


=A0=A0=A0=A0=A0=A0=A0 But for your problem what you can do is you keep the =
logs/userdata in hdfs with different timestamps.<br>
=A0=A0=A0=A0=A0=A0=A0 Run some mapreduce jobs at certain intervals to extra=
ct required data from those logs and put it to Hbase/Cassandra/Mongodb.<br>=
<br>=A0=A0=A0=A0=A0=A0=A0 Mongodb read performance is quite faster also it =
supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read=
/write the data to Mongodb thru Hadoop-Mapreduce.<br>


=A0=A0=A0=A0 <br>=A0=A0=A0=A0=A0=A0=A0 If you are very specific about updat=
ing the hdfs files directly then you have to use any commercial Hadoop pack=
ages like MapR which supports updating the HDFS files.<br><br>Best,<br>Mahe=
sh Balija,<br>Calsoft Labs.<div>


<div><br>
<br><br><div class=3D"gmail_quote">On Sun, Nov 25, 2012 at 9:40 AM, bharath=
 vissapragada <span dir=3D"ltr">&lt;<a href=3D"mailto:bharathvissapragada19=
90@gmail.com" target=3D"_blank">bharathvissapragada1990@gmail.com</a>&gt;</=
span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jeff,<div><br></div><d=
iv>Please look at [1] . You can store your data in HBase tables and query t=
hem normally just by mapping them to Hive tables. Regarding Cassandra suppo=
rt, please follow JIRA [2], its not yet in the trunk I suppose!</div>


<div><br></div><div>[1]=A0<a href=3D"https://cwiki.apache.org/Hive/hbaseint=
egration.html" target=3D"_blank">https://cwiki.apache.org/Hive/hbaseintegra=
tion.html</a></div><div>[2]=A0<a href=3D"https://issues.apache.org/jira/bro=
wse/HIVE-1434" target=3D"_blank">https://issues.apache.org/jira/browse/HIVE=
-1434</a></div>


<div><br></div><div>Thanks,<div><div><br><br><div class=3D"gmail_quote">On =
Sun, Nov 25, 2012 at 2:26 AM, jeff l <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:jeff.pubmail@gmail.com" target=3D"_blank">jeff.pubmail@gmail.com</a>&gt;<=
/span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Hi All,<div><br></div><di=
v>I&#39;m coming from the RDBMS world and am looking at hdfs for long term =
data storage and analysis.</div>


<div><br></div><div>I&#39;ve done some research and set up some smallish hd=
fs clusters with hive for testing but I&#39;m having a little trouble under=
standing how everything fits together and was hoping someone could point me=
 in the right direction.</div>


<div><br></div><div>I&#39;m looking at storing two types of data:</div><div=
><br></div><div>1. Append-only data - e.g. weblogs or user logins</div><div=
>2. Account/User data</div><div><br></div><div>HDFS seems to be perfect for=
 append-only data like #1, but I&#39;m having trouble figuring out what to =
do with data that may change frequently.</div>


<div><br></div><div>A simple example would be user data where various bits =
of information: email, etc may change from day to day. =A0Would hbase or ca=
ssandra be the better way to go for this type of data, and can I overlay hi=
ve over all ( hdfs, hbase, cassandra ) so that I can query the data through=
 a single interface?</div>


<div><br></div>
<div>Thanks in advance for any help.</div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Regards,<br>Bharath .V<br>w:<a href=3D"http:/=
/researchweb.iiit.ac.in/%7Ebharath.v" target=3D"_blank">http://researchweb.=
iiit.ac.in/~bharath.v</a><br>


</font></span></div>
</blockquote></div><br>
</div></div></blockquote></div><br><br clear=3D"all"><br></div></div><span>=
<font color=3D"#888888">-- <br>Thanks &amp; Regards,<br>Anil Gupta<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Thanks &amp=
; Regards,<br>Anil Gupta<br>
</div>
</div></div></blockquote><br></div></div></blockquote></div><br><br clear=
=3D"all"><br>-- <br>Thanks &amp; Regards,<br>Anil Gupta<br>
</div>

--bcaec52999cde6fe6704cfbc56b0--