hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeff l <jeff.pubm...@gmail.com>
Subject Re: Best practice for storage of data that changes
Date Wed, 28 Nov 2012 17:55:55 GMT
Hi,

I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql )
and MongoDB but don't feel any are quite right for this problem.  The
amount of data being stored and access requirements just don't match up
well.

I was hoping to keep the stack as simple as possible and just use hdfs but
everything I was seeing kept pointing to the need for some other datastore.
 I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <anilgupta84@gmail.com> wrote:

> Hi Jeff,
>
> My two cents below:
>
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store
> append only data. If you want to do analysis of weblogs or user logins then
> Hadoop is a suitable solution for it.
>
>
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then
> analyze whether it really needs a NoSql solution or not.
> As you were talking about maintaining User Data in NoSql. Why NoSql
> instead of RDBMS? What is the size of data? Which NoSql features are the
> selling points for you?
>
> For real time read writes you can have a look at Cassandra or HBase. But,
> i would suggest you to have a very close look at both of them because both
> of them have their own advantages. So, the choice will be dependent on your
> use case.
>
> One added advantage with HBase is that it has a deeper integration with
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
> Tools. HBase has integration with Hive querying but AFAIK it has some
> limitations.
>
> HTH,
> Anil Gupta
>
>
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > wrote:
>
>> Hi Jeff,
>>
>>         As HDFS paradigm is "Write once and read many" you cannot be able
>> to update the files on HDFS.
>>         But for your problem what you can do is you keep the
>> logs/userdata in hdfs with different timestamps.
>>         Run some mapreduce jobs at certain intervals to extract required
>> data from those logs and put it to Hbase/Cassandra/Mongodb.
>>
>>         Mongodb read performance is quite faster also it supports ad-hoc
>> querying. Also you can use Hadoop-MongoDB connector to read/write the data
>> to Mongodb thru Hadoop-Mapreduce.
>>
>>         If you are very specific about updating the hdfs files directly
>> then you have to use any commercial Hadoop packages like MapR which
>> supports updating the HDFS files.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>>
>> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
>> bharathvissapragada1990@gmail.com> wrote:
>>
>>> Hi Jeff,
>>>
>>> Please look at [1] . You can store your data in HBase tables and query
>>> them normally just by mapping them to Hive tables. Regarding Cassandra
>>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>>>
>>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>>>
>>> Thanks,
>>>
>>>
>>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>>> data storage and analysis.
>>>>
>>>> I've done some research and set up some smallish hdfs clusters with
>>>> hive for testing but I'm having a little trouble understanding how
>>>> everything fits together and was hoping someone could point me in the right
>>>> direction.
>>>>
>>>> I'm looking at storing two types of data:
>>>>
>>>> 1. Append-only data - e.g. weblogs or user logins
>>>> 2. Account/User data
>>>>
>>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>>> trouble figuring out what to do with data that may change frequently.
>>>>
>>>> A simple example would be user data where various bits of information:
>>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>>> better way to go for this type of data, and can I overlay hive over all (
>>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>>> interface?
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Bharath .V
>>> w:http://researchweb.iiit.ac.in/~bharath.v
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Mime
View raw message