hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anil gupta <anilgupt...@gmail.com>
Subject Re: Best practice for storage of data that changes
Date Fri, 30 Nov 2012 20:35:05 GMT
Hi Guys,

I posted our study on my blog:
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

We ended up choosing HBase because:
1. HBase provides Range based scan, and ordered partitioning.
2. HBase is closely integrated with Hadoop ecosystem.
3. HBase is strongly consistent as compared to Cassandra which is
eventually consistent.

As i said earlier in my email that selection of NoSql solution depends on
the use case. There are subtle differences between NoSql solution and each
of them have their own "Sweet Spot". So, pick yours after careful
evaluation.

PS: Added the HBase mailing list also since this is more about HBase.

Hope This Helps,
Anil Gupta


On Thu, Nov 29, 2012 at 8:51 PM, Lance Norskog <goksron@gmail.com> wrote:

> Please! There are lots of blogs etc. about the two, but very few
> head-to-head for a real use case.
>
> ------------------------------
>
> *From: *"anil gupta" <anilgupta84@gmail.com>
> *To: *"common-user@hadoop.apache.org" <user@hadoop.apache.org>
> *Sent: *Wednesday, November 28, 2012 11:01:55 AM
> *Subject: *Re: Best practice for storage of data that changes
>
>
> Hi Jeff,
>
> At my workplace "Intuit", we did some detailed study to evaluate HBase and
> Cassandra for our use case. I will see if i can post the comparative study
> on my public blog or on this mailing list.
>
> BTW, What is your use case? What bottleneck are you hitting at current
> solutions? If you can share some details then HBase community will try to
> help you out.
>
> Thanks,
> Anil Gupta
>
>
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>
>> Hi,
>>
>> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
>> and MongoDB but don't feel any are quite right for this problem.  The
>> amount of data being stored and access requirements just don't match up
>> well.
>>
>> I was hoping to keep the stack as simple as possible and just use hdfs
>> but everything I was seeing kept pointing to the need for some other
>> datastore.  I'll check out both HBase and Cassandra.
>>
>> Thanks for the feedback.
>>
>>
>> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <anilgupta84@gmail.com>wrote:
>>
>>> Hi Jeff,
>>>
>>> My two cents below:
>>>
>>> 1st use case: Append-only data - e.g. weblogs or user logins
>>> As others have already mentioned that Hadoop is suitable enough to store
>>> append only data. If you want to do analysis of weblogs or user logins then
>>> Hadoop is a suitable solution for it.
>>>
>>>
>>> 2nd use case: Account/User data
>>> First, of all i would suggest you to have a look at your use case then
>>> analyze whether it really needs a NoSql solution or not.
>>> As you were talking about maintaining User Data in NoSql. Why NoSql
>>> instead of RDBMS? What is the size of data? Which NoSql features are the
>>> selling points for you?
>>>
>>> For real time read writes you can have a look at Cassandra or HBase.
>>> But, i would suggest you to have a very close look at both of them because
>>> both of them have their own advantages. So, the choice will be dependent on
>>> your use case.
>>>
>>> One added advantage with HBase is that it has a deeper integration with
>>> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
>>> Tools. HBase has integration with Hive querying but AFAIK it has some
>>> limitations.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>>
>>> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
>>> balijamahesh.mca@gmail.com> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>>         As HDFS paradigm is "Write once and read many" you cannot be
>>>> able to update the files on HDFS.
>>>>         But for your problem what you can do is you keep the
>>>> logs/userdata in hdfs with different timestamps.
>>>>         Run some mapreduce jobs at certain intervals to extract
>>>> required data from those logs and put it to Hbase/Cassandra/Mongodb.
>>>>
>>>>         Mongodb read performance is quite faster also it supports
>>>> ad-hoc querying. Also you can use Hadoop-MongoDB connector to read/write
>>>> the data to Mongodb thru Hadoop-Mapreduce.
>>>>
>>>>         If you are very specific about updating the hdfs files directly
>>>> then you have to use any commercial Hadoop packages like MapR which
>>>> supports updating the HDFS files.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>>>> bharathvissapragada1990@gmail.com> wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Please look at [1] . You can store your data in HBase tables and query
>>>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>>>
>>>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com>wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>>>> data storage and analysis.
>>>>>>
>>>>>> I've done some research and set up some smallish hdfs clusters with
>>>>>> hive for testing but I'm having a little trouble understanding how
>>>>>> everything fits together and was hoping someone could point me in
the right
>>>>>> direction.
>>>>>>
>>>>>> I'm looking at storing two types of data:
>>>>>>
>>>>>> 1. Append-only data - e.g. weblogs or user logins
>>>>>> 2. Account/User data
>>>>>>
>>>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>>>> trouble figuring out what to do with data that may change frequently.
>>>>>>
>>>>>> A simple example would be user data where various bits of
>>>>>> information: email, etc may change from day to day.  Would hbase
or
>>>>>> cassandra be the better way to go for this type of data, and can
I overlay
>>>>>> hive over all ( hdfs, hbase, cassandra ) so that I can query the
data
>>>>>> through a single interface?
>>>>>>
>>>>>> Thanks in advance for any help.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bharath .V
>>>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>
>


-- 
Thanks & Regards,
Anil Gupta

Mime
View raw message