Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9A997D832 for ; Fri, 30 Nov 2012 20:35:56 +0000 (UTC) Received: (qmail 68884 invoked by uid 500); 30 Nov 2012 20:35:51 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 68751 invoked by uid 500); 30 Nov 2012 20:35:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 68743 invoked by uid 99); 30 Nov 2012 20:35:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Nov 2012 20:35:51 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com designates 209.85.210.178 as permitted sender) Received: from [209.85.210.178] (HELO mail-ia0-f178.google.com) (209.85.210.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Nov 2012 20:35:46 +0000 Received: by mail-ia0-f178.google.com with SMTP id k25so806853iah.37 for ; Fri, 30 Nov 2012 12:35:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Up+DEPmutkp8sZ3048Rou3Iqeva16XnfIqZPxAXftL8=; b=WtrRFN4sMnlR+TFuLKLymtJwa3dHNk3fC6nBPxJogt+4PJtxPHZD+gbLCmaJz9TmjJ v3bUZDP1Lx6EefBQoTzyc+ynPqTguBk9wyFpHmZvdgHrHXJXhwcli6zp4MQsWgrFEDi5 5GhEyRptJttTVct11R/240TAC/0jTQLUeHVKcrUpK5cf1+CrMsaKKC131VlNyCifsIDu d5gVDTpl/rGFqSqzU9Mzl4DyIYuhTn+KwARsehuJp/ipk1sv9L3KxTpC9aMVv0HyfOLn Ob88IMy4Z3Njh5X+7pXlnFWQyXxbc2V18aVPyfZietCmMWdeURXYu1cgxjepnxuc9q/D pWBA== Received: by 10.43.50.197 with SMTP id vf5mr2024500icb.13.1354307725622; Fri, 30 Nov 2012 12:35:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.67.197 with HTTP; Fri, 30 Nov 2012 12:35:05 -0800 (PST) In-Reply-To: <7176940.498.1354235300476.JavaMail.lancenorskog@Lance-Norskogs-MacBook-Pro.local> References: <7176940.498.1354235300476.JavaMail.lancenorskog@Lance-Norskogs-MacBook-Pro.local> From: anil gupta Date: Fri, 30 Nov 2012 12:35:05 -0800 Message-ID: Subject: Re: Best practice for storage of data that changes To: user@hbase.apache.org Cc: goksron@gmail.com, jeff.pubmail@gmail.com Content-Type: multipart/alternative; boundary=bcaec52999cde6fe6704cfbc56b0 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec52999cde6fe6704cfbc56b0 Content-Type: text/plain; charset=ISO-8859-1 Hi Guys, I posted our study on my blog: http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html We ended up choosing HBase because: 1. HBase provides Range based scan, and ordered partitioning. 2. HBase is closely integrated with Hadoop ecosystem. 3. HBase is strongly consistent as compared to Cassandra which is eventually consistent. As i said earlier in my email that selection of NoSql solution depends on the use case. There are subtle differences between NoSql solution and each of them have their own "Sweet Spot". So, pick yours after careful evaluation. PS: Added the HBase mailing list also since this is more about HBase. Hope This Helps, Anil Gupta On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog wrote: > Please! There are lots of blogs etc. about the two, but very few > head-to-head for a real use case. > > ------------------------------ > > *From: *"anil gupta" > *To: *"common-user@hadoop.apache.org" > *Sent: *Wednesday, November 28, 2012 11:01:55 AM > *Subject: *Re: Best practice for storage of data that changes > > > Hi Jeff, > > At my workplace "Intuit", we did some detailed study to evaluate HBase and > Cassandra for our use case. I will see if i can post the comparative study > on my public blog or on this mailing list. > > BTW, What is your use case? What bottleneck are you hitting at current > solutions? If you can share some details then HBase community will try to > help you out. > > Thanks, > Anil Gupta > > > On Wed, Nov 28, 2012 at 9:55 AM, jeff l wrote: > >> Hi, >> >> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) >> and MongoDB but don't feel any are quite right for this problem. The >> amount of data being stored and access requirements just don't match up >> well. >> >> I was hoping to keep the stack as simple as possible and just use hdfs >> but everything I was seeing kept pointing to the need for some other >> datastore. I'll check out both HBase and Cassandra. >> >> Thanks for the feedback. >> >> >> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta wrote: >> >>> Hi Jeff, >>> >>> My two cents below: >>> >>> 1st use case: Append-only data - e.g. weblogs or user logins >>> As others have already mentioned that Hadoop is suitable enough to store >>> append only data. If you want to do analysis of weblogs or user logins then >>> Hadoop is a suitable solution for it. >>> >>> >>> 2nd use case: Account/User data >>> First, of all i would suggest you to have a look at your use case then >>> analyze whether it really needs a NoSql solution or not. >>> As you were talking about maintaining User Data in NoSql. Why NoSql >>> instead of RDBMS? What is the size of data? Which NoSql features are the >>> selling points for you? >>> >>> For real time read writes you can have a look at Cassandra or HBase. >>> But, i would suggest you to have a very close look at both of them because >>> both of them have their own advantages. So, the choice will be dependent on >>> your use case. >>> >>> One added advantage with HBase is that it has a deeper integration with >>> Hadoop ecosystem so you can do a lot of stuff on HBase data using Hadoop >>> Tools. HBase has integration with Hive querying but AFAIK it has some >>> limitations. >>> >>> HTH, >>> Anil Gupta >>> >>> >>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija < >>> balijamahesh.mca@gmail.com> wrote: >>> >>>> Hi Jeff, >>>> >>>> As HDFS paradigm is "Write once and read many" you cannot be >>>> able to update the files on HDFS. >>>> But for your problem what you can do is you keep the >>>> logs/userdata in hdfs with different timestamps. >>>> Run some mapreduce jobs at certain intervals to extract >>>> required data from those logs and put it to Hbase/Cassandra/Mongodb. >>>> >>>> Mongodb read performance is quite faster also it supports >>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write >>>> the data to Mongodb thru Hadoop-Mapreduce. >>>> >>>> If you are very specific about updating the hdfs files directly >>>> then you have to use any commercial Hadoop packages like MapR which >>>> supports updating the HDFS files. >>>> >>>> Best, >>>> Mahesh Balija, >>>> Calsoft Labs. >>>> >>>> >>>> >>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada < >>>> bharathvissapragada1990@gmail.com> wrote: >>>> >>>>> Hi Jeff, >>>>> >>>>> Please look at [1] . You can store your data in HBase tables and query >>>>> them normally just by mapping them to Hive tables. Regarding Cassandra >>>>> support, please follow JIRA [2], its not yet in the trunk I suppose! >>>>> >>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html >>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434 >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term >>>>>> data storage and analysis. >>>>>> >>>>>> I've done some research and set up some smallish hdfs clusters with >>>>>> hive for testing but I'm having a little trouble understanding how >>>>>> everything fits together and was hoping someone could point me in the right >>>>>> direction. >>>>>> >>>>>> I'm looking at storing two types of data: >>>>>> >>>>>> 1. Append-only data - e.g. weblogs or user logins >>>>>> 2. Account/User data >>>>>> >>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having >>>>>> trouble figuring out what to do with data that may change frequently. >>>>>> >>>>>> A simple example would be user data where various bits of >>>>>> information: email, etc may change from day to day. Would hbase or >>>>>> cassandra be the better way to go for this type of data, and can I overlay >>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the data >>>>>> through a single interface? >>>>>> >>>>>> Thanks in advance for any help. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Bharath .V >>>>> w:http://researchweb.iiit.ac.in/~bharath.v >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks & Regards, >>> Anil Gupta >>> >> >> > > > -- > Thanks & Regards, > Anil Gupta > > > -- Thanks & Regards, Anil Gupta --bcaec52999cde6fe6704cfbc56b0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Guys,

I posted our study on my blog: http://bi= gdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended = up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase i= s closely integrated with Hadoop ecosystem.
3. HBase is strongly consist= ent as compared to Cassandra which is eventually consistent.

As i sa= id earlier in my email that selection of NoSql solution depends on the use = case. There are subtle differences between NoSql solution and each of them = have their own "Sweet Spot". So, pick yours after careful evaluat= ion.

PS: Added the HBase mailing list also since this is more about HBase. <= br>
Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <g= oksron@gmail.com> wrote:
Please! There are lots of blogs etc. about th= e two, but very few head-to-head for a real use case.


From: = "anil gupta" <anilgupta84@gmail.com>
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
= Sent: Wednesday, November 28, 2012 11:01:55 AM
Subject: Re: Best practice for storage of data that changes


Hi Jeff,

At my workplace "Intuit", we did some de= tailed study to evaluate HBase and Cassandra for our use case. I will see i= f i can post the comparative study on my public blog or on this mailing lis= t.

BTW, What is your use case? What bottleneck are you hitting at current = solutions? If you can share some details then HBase community will try to h= elp you out.

Thanks,
Anil Gupta


On Wed, Nov 28, 2012 at 9:55 AM, jeff l <jeff.pubmail@gmail.com> wrote:
Hi,

I have quite a bit of experience with RDBMSs ( O= racle, Postgres, Mysql ) and MongoDB but don't feel any are quite right= for this problem. =A0The amount of data being stored and access requiremen= ts just don't match up well.

I was hoping to keep the stack as simple as possible an= d just use hdfs but everything I was seeing kept pointing to the need for s= ome other datastore. =A0I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11= PM, anil gupta <anilgupta84@gmail.com> wrote:
Hi Jeff,

My two ce= nts below:

1st use case: Append-only data - e.g. weblogs or user lo= gins
As others have already mentioned that Hadoop is suitable enough to store ap= pend only data. If you want to do analysis of weblogs or user logins then H= adoop is a suitable solution for it.


2nd use case: Account/User data
First, of all i would suggest y= ou to have a look at your use case then analyze whether it really needs a N= oSql solution or not.
As you were talking about maintaining User Data i= n NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql = features are the selling points for you?

For real time read writes you can have a look at Cassandra or HBase. Bu= t, i would suggest you to have a very close look at both of them because bo= th of them have their own advantages. So, the choice will be dependent on y= our use case.

One added advantage with HBase is that it has a deeper integration with= Hadoop ecosystem so you can do a lot of stuff on HBase data=A0 using Hadoo= p Tools. HBase has integration with Hive querying but AFAIK it has some lim= itations.

HTH,
Anil Gupta


On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com> wrote:
Hi Jeff,

=A0=A0=A0= =A0=A0=A0=A0 As HDFS paradigm is "Write once and read many" you c= annot be able to update the files on HDFS.
=A0=A0=A0=A0=A0=A0=A0 But for your problem what you can do is you keep the = logs/userdata in hdfs with different timestamps.
=A0=A0=A0=A0=A0=A0=A0 Run some mapreduce jobs at certain intervals to extra= ct required data from those logs and put it to Hbase/Cassandra/Mongodb.
=
=A0=A0=A0=A0=A0=A0=A0 Mongodb read performance is quite faster also it = supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read= /write the data to Mongodb thru Hadoop-Mapreduce.
=A0=A0=A0=A0
=A0=A0=A0=A0=A0=A0=A0 If you are very specific about updat= ing the hdfs files directly then you have to use any commercial Hadoop pack= ages like MapR which supports updating the HDFS files.

Best,
Mahe= sh Balija,
Calsoft Labs.



On Sun, Nov 25, 2012 at 9:40 AM, bharath= vissapragada <bharathvissapragada1990@gmail.com> wrote:
Hi Jeff,

Please look at [1] . You can store your data in HBase tables and query t= hem normally just by mapping them to Hive tables. Regarding Cassandra suppo= rt, please follow JIRA [2], its not yet in the trunk I suppose!


Thanks,


On = Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com><= /span> wrote:
Hi All,

I'm coming from the RDBMS world and am looking at hdfs for long term = data storage and analysis.

I've done some research and set up some smallish hd= fs clusters with hive for testing but I'm having a little trouble under= standing how everything fits together and was hoping someone could point me= in the right direction.

I'm looking at storing two types of data:

1. Append-only data - e.g. weblogs or user logins
2. Account/User data

HDFS seems to be perfect for= append-only data like #1, but I'm having trouble figuring out what to = do with data that may change frequently.

A simple example would be user data where various bits = of information: email, etc may change from day to day. =A0Would hbase or ca= ssandra be the better way to go for this type of data, and can I overlay hi= ve over all ( hdfs, hbase, cassandra ) so that I can query the data through= a single interface?

Thanks in advance for any help.



<= font color=3D"#888888">--
Regards,
Bharath .V
w:http://researchweb.= iiit.ac.in/~bharath.v




= --
Thanks & Regards,
Anil Gupta




--
Thanks &= ; Regards,
Anil Gupta




--
Thanks & Regards,
Anil Gupta
--bcaec52999cde6fe6704cfbc56b0--