hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anil gupta <anilgupt...@gmail.com>
Subject Re: Best practice for storage of data that changes
Date Sun, 25 Nov 2012 21:11:29 GMT
Hi Jeff,

My two cents below:

1st use case: Append-only data - e.g. weblogs or user logins
As others have already mentioned that Hadoop is suitable enough to store
append only data. If you want to do analysis of weblogs or user logins then
Hadoop is a suitable solution for it.

2nd use case: Account/User data
First, of all i would suggest you to have a look at your use case then
analyze whether it really needs a NoSql solution or not.
As you were talking about maintaining User Data in NoSql. Why NoSql instead
of RDBMS? What is the size of data? Which NoSql features are the selling
points for you?

For real time read writes you can have a look at Cassandra or HBase. But, i
would suggest you to have a very close look at both of them because both of
them have their own advantages. So, the choice will be dependent on your
use case.

One added advantage with HBase is that it has a deeper integration with
Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop
Tools. HBase has integration with Hive querying but AFAIK it has some

Anil Gupta

On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija

> Hi Jeff,
>         As HDFS paradigm is "Write once and read many" you cannot be able
> to update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata
> in hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required
> data from those logs and put it to Hbase/Cassandra/Mongodb.
>         Mongodb read performance is quite faster also it supports ad-hoc
> querying. Also you can use Hadoop-MongoDB connector to read/write the data
> to Mongodb thru Hadoop-Mapreduce.
>         If you are very specific about updating the hdfs files directly
> then you have to use any commercial Hadoop packages like MapR which
> supports updating the HDFS files.
> Best,
> Mahesh Balija,
> Calsoft Labs.
> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>> Hi Jeff,
>> Please look at [1] . You can store your data in HBase tables and query
>> them normally just by mapping them to Hive tables. Regarding Cassandra
>> support, please follow JIRA [2], its not yet in the trunk I suppose!
>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
>> [2] https://issues.apache.org/jira/browse/HIVE-1434
>> Thanks,
>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>>> Hi All,
>>> I'm coming from the RDBMS world and am looking at hdfs for long term
>>> data storage and analysis.
>>> I've done some research and set up some smallish hdfs clusters with hive
>>> for testing but I'm having a little trouble understanding how everything
>>> fits together and was hoping someone could point me in the right direction.
>>> I'm looking at storing two types of data:
>>> 1. Append-only data - e.g. weblogs or user logins
>>> 2. Account/User data
>>> HDFS seems to be perfect for append-only data like #1, but I'm having
>>> trouble figuring out what to do with data that may change frequently.
>>> A simple example would be user data where various bits of information:
>>> email, etc may change from day to day.  Would hbase or cassandra be the
>>> better way to go for this type of data, and can I overlay hive over all (
>>> hdfs, hbase, cassandra ) so that I can query the data through a single
>>> interface?
>>> Thanks in advance for any help.
>> --
>> Regards,
>> Bharath .V
>> w:http://researchweb.iiit.ac.in/~bharath.v

Thanks & Regards,
Anil Gupta

View raw message