hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Balija <balijamahesh....@gmail.com>
Subject Re: Best practice for storage of data that changes
Date Sun, 25 Nov 2012 12:52:09 GMT
Hi Jeff,

        As HDFS paradigm is "Write once and read many" you cannot be able
to update the files on HDFS.
        But for your problem what you can do is you keep the logs/userdata
in hdfs with different timestamps.
        Run some mapreduce jobs at certain intervals to extract required
data from those logs and put it to Hbase/Cassandra/Mongodb.

        Mongodb read performance is quite faster also it supports ad-hoc
querying. Also you can use Hadoop-MongoDB connector to read/write the data
to Mongodb thru Hadoop-Mapreduce.

        If you are very specific about updating the hdfs files directly
then you have to use any commercial Hadoop packages like MapR which
supports updating the HDFS files.

Mahesh Balija,
Calsoft Labs.

On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
bharathvissapragada1990@gmail.com> wrote:

> Hi Jeff,
> Please look at [1] . You can store your data in HBase tables and query
> them normally just by mapping them to Hive tables. Regarding Cassandra
> support, please follow JIRA [2], its not yet in the trunk I suppose!
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> Thanks,
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com> wrote:
>> Hi All,
>> I'm coming from the RDBMS world and am looking at hdfs for long term data
>> storage and analysis.
>> I've done some research and set up some smallish hdfs clusters with hive
>> for testing but I'm having a little trouble understanding how everything
>> fits together and was hoping someone could point me in the right direction.
>> I'm looking at storing two types of data:
>> 1. Append-only data - e.g. weblogs or user logins
>> 2. Account/User data
>> HDFS seems to be perfect for append-only data like #1, but I'm having
>> trouble figuring out what to do with data that may change frequently.
>> A simple example would be user data where various bits of information:
>> email, etc may change from day to day.  Would hbase or cassandra be the
>> better way to go for this type of data, and can I overlay hive over all (
>> hdfs, hbase, cassandra ) so that I can query the data through a single
>> interface?
>> Thanks in advance for any help.
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v

View raw message