Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 906A8DBB9 for ; Wed, 28 Nov 2012 17:56:28 +0000 (UTC) Received: (qmail 90884 invoked by uid 500); 28 Nov 2012 17:56:23 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 90790 invoked by uid 500); 28 Nov 2012 17:56:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90782 invoked by uid 99); 28 Nov 2012 17:56:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 17:56:23 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jeff.pubmail@gmail.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 17:56:16 +0000 Received: by mail-lb0-f176.google.com with SMTP id k6so11866016lbo.35 for ; Wed, 28 Nov 2012 09:55:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=e9vCCPkfRXCD8TaGXEnOsdPkB1wtM5uoQeh1yxnyNNc=; b=q7hjjJnpxh/GVvozTngA72fe1MfclUFcf4IDElazkhUEeaWyvdN/7HJDj2bDr45gJr 6TJqtkLTXTPcJUZnUjhIH+ohwUOclqh/lFl3dq5He7/Hn4be5ueJ5qUv1UxYq6UCw/aS ZJcKdIXM5brSw+TWuoHqq3VzBtm1SYbxzWmUsjVxC+AmAYxQeLlvTs84RIfIHSE5z5mV 4+U+3gzxYYtf2nxicHPxWlpwUtoagSiL0uFIFVlVv+oFoztmpEKXwD0N0Iz3HTtDenNR 9ZhSeBDJIqT3pUgpziEblMncT8DQ2cjN8t4NPBAn0nmg5BGelNcIwjDB1e/rg3j6bfl+ qm2Q== MIME-Version: 1.0 Received: by 10.152.110.229 with SMTP id id5mr18914267lab.36.1354125355629; Wed, 28 Nov 2012 09:55:55 -0800 (PST) Received: by 10.112.162.234 with HTTP; Wed, 28 Nov 2012 09:55:55 -0800 (PST) In-Reply-To: References: Date: Wed, 28 Nov 2012 09:55:55 -0800 Message-ID: Subject: Re: Best practice for storage of data that changes From: jeff l To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d040713e3cdbc7f04cf91e036 X-Virus-Checked: Checked by ClamAV on apache.org --f46d040713e3cdbc7f04cf91e036 Content-Type: text/plain; charset=ISO-8859-1 Hi, I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and MongoDB but don't feel any are quite right for this problem. The amount of data being stored and access requirements just don't match up well. I was hoping to keep the stack as simple as possible and just use hdfs but everything I was seeing kept pointing to the need for some other datastore. I'll check out both HBase and Cassandra. Thanks for the feedback. On Sun, Nov 25, 2012 at 1:11 PM, anil gupta wrote: > Hi Jeff, > > My two cents below: > > 1st use case: Append-only data - e.g. weblogs or user logins > As others have already mentioned that Hadoop is suitable enough to store > append only data. If you want to do analysis of weblogs or user logins then > Hadoop is a suitable solution for it. > > > 2nd use case: Account/User data > First, of all i would suggest you to have a look at your use case then > analyze whether it really needs a NoSql solution or not. > As you were talking about maintaining User Data in NoSql. Why NoSql > instead of RDBMS? What is the size of data? Which NoSql features are the > selling points for you? > > For real time read writes you can have a look at Cassandra or HBase. But, > i would suggest you to have a very close look at both of them because both > of them have their own advantages. So, the choice will be dependent on your > use case. > > One added advantage with HBase is that it has a deeper integration with > Hadoop ecosystem so you can do a lot of stuff on HBase data using Hadoop > Tools. HBase has integration with Hive querying but AFAIK it has some > limitations. > > HTH, > Anil Gupta > > > On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija > wrote: > >> Hi Jeff, >> >> As HDFS paradigm is "Write once and read many" you cannot be able >> to update the files on HDFS. >> But for your problem what you can do is you keep the >> logs/userdata in hdfs with different timestamps. >> Run some mapreduce jobs at certain intervals to extract required >> data from those logs and put it to Hbase/Cassandra/Mongodb. >> >> Mongodb read performance is quite faster also it supports ad-hoc >> querying. Also you can use Hadoop-MongoDB connector to read/write the data >> to Mongodb thru Hadoop-Mapreduce. >> >> If you are very specific about updating the hdfs files directly >> then you have to use any commercial Hadoop packages like MapR which >> supports updating the HDFS files. >> >> Best, >> Mahesh Balija, >> Calsoft Labs. >> >> >> >> On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada < >> bharathvissapragada1990@gmail.com> wrote: >> >>> Hi Jeff, >>> >>> Please look at [1] . You can store your data in HBase tables and query >>> them normally just by mapping them to Hive tables. Regarding Cassandra >>> support, please follow JIRA [2], its not yet in the trunk I suppose! >>> >>> [1] https://cwiki.apache.org/Hive/hbaseintegration.html >>> [2] https://issues.apache.org/jira/browse/HIVE-1434 >>> >>> Thanks, >>> >>> >>> On Sun, Nov 25, 2012 at 2:26 AM, jeff l wrote: >>> >>>> Hi All, >>>> >>>> I'm coming from the RDBMS world and am looking at hdfs for long term >>>> data storage and analysis. >>>> >>>> I've done some research and set up some smallish hdfs clusters with >>>> hive for testing but I'm having a little trouble understanding how >>>> everything fits together and was hoping someone could point me in the right >>>> direction. >>>> >>>> I'm looking at storing two types of data: >>>> >>>> 1. Append-only data - e.g. weblogs or user logins >>>> 2. Account/User data >>>> >>>> HDFS seems to be perfect for append-only data like #1, but I'm having >>>> trouble figuring out what to do with data that may change frequently. >>>> >>>> A simple example would be user data where various bits of information: >>>> email, etc may change from day to day. Would hbase or cassandra be the >>>> better way to go for this type of data, and can I overlay hive over all ( >>>> hdfs, hbase, cassandra ) so that I can query the data through a single >>>> interface? >>>> >>>> Thanks in advance for any help. >>>> >>> >>> >>> >>> -- >>> Regards, >>> Bharath .V >>> w:http://researchweb.iiit.ac.in/~bharath.v >>> >> >> > > > -- > Thanks & Regards, > Anil Gupta > --f46d040713e3cdbc7f04cf91e036 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

I have quite a bit of experience with RDBMSs ( O= racle, Postgres, Mysql ) and MongoDB but don't feel any are quite right= for this problem. =A0The amount of data being stored and access requiremen= ts just don't match up well.

I was hoping to keep the stack as simple as possible an= d just use hdfs but everything I was seeing kept pointing to the need for s= ome other datastore. =A0I'll check out both HBase and Cassandra.

Thanks for the feedback.


On Sun, Nov 25, 2012 at 1:11 PM, anil = gupta <anilgupta84@gmail.com> wrote:
Hi Jeff,

My two cents below:

1= st use case: Append-only data - e.g. weblogs or user logins
As others h= ave already mentioned that Hadoop is suitable enough to store append only d= ata. If you want to do analysis of weblogs or user logins then Hadoop is a = suitable solution for it.


2nd use case: Account/User data
First, of all i would suggest y= ou to have a look at your use case then analyze whether it really needs a N= oSql solution or not.
As you were talking about maintaining User Data i= n NoSql. Why NoSql instead of RDBMS? What is the size of data? Which NoSql = features are the selling points for you?

For real time read writes you can have a look at Cassandra or HBase. Bu= t, i would suggest you to have a very close look at both of them because bo= th of them have their own advantages. So, the choice will be dependent on y= our use case.

One added advantage with HBase is that it has a deeper integration with= Hadoop ecosystem so you can do a lot of stuff on HBase data=A0 using Hadoo= p Tools. HBase has integration with Hive querying but AFAIK it has some lim= itations.

HTH,
Anil Gupta


On Sun, Nov 25, 2012 at 4:52 AM, Mahesh= Balija <balijamahesh.mca@gmail.com> wrote:
Hi Jeff,

=A0=A0=A0=A0=A0=A0=A0 As HDF= S paradigm is "Write once and read many" you cannot be able to up= date the files on HDFS.
=A0=A0=A0=A0=A0=A0=A0 But for your problem what you can do is you keep the = logs/userdata in hdfs with different timestamps.
=A0=A0=A0=A0=A0=A0=A0 Run some mapreduce jobs at certain intervals to extra= ct required data from those logs and put it to Hbase/Cassandra/Mongodb.
=
=A0=A0=A0=A0=A0=A0=A0 Mongodb read performance is quite faster also it = supports ad-hoc querying. Also you can use Hadoop-MongoDB connector to read= /write the data to Mongodb thru Hadoop-Mapreduce.
=A0=A0=A0=A0
=A0=A0=A0=A0=A0=A0=A0 If you are very specific about updat= ing the hdfs files directly then you have to use any commercial Hadoop pack= ages like MapR which supports updating the HDFS files.

Best,
Mahe= sh Balija,
Calsoft Labs.



On Sun, Nov 25, 2012 at 9:40 AM, bharath= vissapragada <bharathvissapragada1990@gmail.com> wrote:
Hi Jeff,

Please look at [= 1] . You can store your data in HBase tables and query them normally just b= y mapping them to Hive tables. Regarding Cassandra support, please follow J= IRA [2], its not yet in the trunk I suppose!


Thanks,


On = Sun, Nov 25, 2012 at 2:26 AM, jeff l <jeff.pubmail@gmail.com><= /span> wrote:
Hi All,

I'm coming fr= om the RDBMS world and am looking at hdfs for long term data storage and an= alysis.

I've done some research and set up some smallish hd= fs clusters with hive for testing but I'm having a little trouble under= standing how everything fits together and was hoping someone could point me= in the right direction.

I'm looking at storing two types of data:

1. Append-only data - e.g. weblogs or user logins
2. Account/User data

HDFS seems to be perfect for= append-only data like #1, but I'm having trouble figuring out what to = do with data that may change frequently.

A simple example would be user data where various bits = of information: email, etc may change from day to day. =A0Would hbase or ca= ssandra be the better way to go for this type of data, and can I overlay hi= ve over all ( hdfs, hbase, cassandra ) so that I can query the data through= a single interface?

Thanks in advance for any help.



<= font color=3D"#888888">--
Regards,
Bharath .V
w:http://researchweb.= iiit.ac.in/~bharath.v




--
Thanks & Regards,
An= il Gupta

--f46d040713e3cdbc7f04cf91e036--