hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Best practice for storage of data that changes
Date Fri, 30 Nov 2012 04:51:41 GMT
Please! There are lots of blogs etc. about the two, but very few head-to-head for a real use
case. 

----- Original Message -----

| From: "anil gupta" <anilgupta84@gmail.com>
| To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
| Sent: Wednesday, November 28, 2012 11:01:55 AM
| Subject: Re: Best practice for storage of data that changes

| Hi Jeff,

| At my workplace "Intuit", we did some detailed study to evaluate
| HBase and Cassandra for our use case. I will see if i can post the
| comparative study on my public blog or on this mailing list.

| BTW, What is your use case? What bottleneck are you hitting at
| current solutions? If you can share some details then HBase
| community will try to help you out.

| Thanks,
| Anil Gupta

| On Wed, Nov 28, 2012 at 9:55 AM, jeff l < jeff.pubmail@gmail.com >
| wrote:

| | Hi,
| 

| | I have quite a bit of experience with RDBMSs ( Oracle, Postgres,
| | Mysql ) and MongoDB but don't feel any are quite right for this
| | problem. The amount of data being stored and access requirements
| | just don't match up well.
| 

| | I was hoping to keep the stack as simple as possible and just use
| | hdfs but everything I was seeing kept pointing to the need for some
| | other datastore. I'll check out both HBase and Cassandra.
| 

| | Thanks for the feedback.
| 

| | On Sun, Nov 25, 2012 at 1:11 PM, anil gupta < anilgupta84@gmail.com
| | >
| | wrote:
| 

| | | Hi Jeff,
| | 
| 

| | | My two cents below:
| | 
| 

| | | 1st use case: Append-only data - e.g. weblogs or user logins
| | 
| 
| | | As others have already mentioned that Hadoop is suitable enough
| | | to
| | | store append only data. If you want to do analysis of weblogs or
| | | user logins then Hadoop is a suitable solution for it.
| | 
| 

| | | 2nd use case: Account/User data
| | 
| 
| | | First, of all i would suggest you to have a look at your use case
| | | then analyze whether it really needs a NoSql solution or not.
| | 
| 
| | | As you were talking about maintaining User Data in NoSql. Why
| | | NoSql
| | | instead of RDBMS? What is the size of data? Which NoSql features
| | | are
| | | the selling points for you?
| | 
| 

| | | For real time read writes you can have a look at Cassandra or
| | | HBase.
| | | But, i would suggest you to have a very close look at both of
| | | them
| | | because both of them have their own advantages. So, the choice
| | | will
| | | be dependent on your use case.
| | 
| 

| | | One added advantage with HBase is that it has a deeper
| | | integration
| | | with Hadoop ecosystem so you can do a lot of stuff on HBase data
| | | using Hadoop Tools. HBase has integration with Hive querying but
| | | AFAIK it has some limitations.
| | 
| 

| | | HTH,
| | 
| 
| | | Anil Gupta
| | 
| 

| | | On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija <
| | | balijamahesh.mca@gmail.com > wrote:
| | 
| 

| | | | Hi Jeff,
| | | 
| | 
| 

| | | | As HDFS paradigm is "Write once and read many" you cannot be
| | | | able
| | | | to
| | | | update the files on HDFS.
| | | 
| | 
| 
| | | | But for your problem what you can do is you keep the
| | | | logs/userdata
| | | | in
| | | | hdfs with different timestamps.
| | | 
| | 
| 
| | | | Run some mapreduce jobs at certain intervals to extract
| | | | required
| | | | data
| | | | from those logs and put it to Hbase/Cassandra/Mongodb.
| | | 
| | 
| 

| | | | Mongodb read performance is quite faster also it supports
| | | | ad-hoc
| | | | querying. Also you can use Hadoop-MongoDB connector to
| | | | read/write
| | | | the data to Mongodb thru Hadoop-Mapreduce.
| | | 
| | 
| 

| | | | If you are very specific about updating the hdfs files directly
| | | | then
| | | | you have to use any commercial Hadoop packages like MapR which
| | | | supports updating the HDFS files.
| | | 
| | 
| 

| | | | Best,
| | | 
| | 
| 
| | | | Mahesh Balija,
| | | 
| | 
| 
| | | | Calsoft Labs.
| | | 
| | 
| 

| | | | On Sun, Nov 25, 2012 at 9:40 AM, bharath vissapragada <
| | | | bharathvissapragada1990@gmail.com > wrote:
| | | 
| | 
| 

| | | | | Hi Jeff,
| | | | 
| | | 
| | 
| 

| | | | | Please look at [1] . You can store your data in HBase tables
| | | | | and
| | | | | query them normally just by mapping them to Hive tables.
| | | | | Regarding
| | | | | Cassandra support, please follow JIRA [2], its not yet in the
| | | | | trunk
| | | | | I suppose!
| | | | 
| | | 
| | 
| 

| | | | | [1] https://cwiki.apache.org/Hive/hbaseintegration.html
| | | | 
| | | 
| | 
| 
| | | | | [2] https://issues.apache.org/jira/browse/HIVE-1434
| | | | 
| | | 
| | 
| 

| | | | | Thanks,
| | | | 
| | | 
| | 
| 

| | | | | On Sun, Nov 25, 2012 at 2:26 AM, jeff l <
| | | | | jeff.pubmail@gmail.com
| | | | | >
| | | | | wrote:
| | | | 
| | | 
| | 
| 

| | | | | | Hi All,
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm coming from the RDBMS world and am looking at hdfs for
| | | | | | long
| | | | | | term
| | | | | | data storage and analysis.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I've done some research and set up some smallish hdfs
| | | | | | clusters
| | | | | | with
| | | | | | hive for testing but I'm having a little trouble
| | | | | | understanding
| | | | | | how
| | | | | | everything fits together and was hoping someone could point
| | | | | | me
| | | | | | in
| | | | | | the right direction.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | I'm looking at storing two types of data:
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | 1. Append-only data - e.g. weblogs or user logins
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | | 2. Account/User data
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | HDFS seems to be perfect for append-only data like #1, but
| | | | | | I'm
| | | | | | having
| | | | | | trouble figuring out what to do with data that may change
| | | | | | frequently.
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | A simple example would be user data where various bits of
| | | | | | information: email, etc may change from day to day. Would
| | | | | | hbase
| | | | | | or
| | | | | | cassandra be the better way to go for this type of data,
| | | | | | and
| | | | | | can
| | | | | | I
| | | | | | overlay hive over all ( hdfs, hbase, cassandra ) so that I
| | | | | | can
| | | | | | query
| | | | | | the data through a single interface?
| | | | | 
| | | | 
| | | 
| | 
| 

| | | | | | Thanks in advance for any help.
| | | | | 
| | | | 
| | | 
| | 
| 
| | | | | --
| | | | 
| | | 
| | 
| 
| | | | | Regards,
| | | | 
| | | 
| | 
| 
| | | | | Bharath .V
| | | | 
| | | 
| | 
| 
| | | | | w: http://researchweb.iiit.ac.in/~bharath.v
| | | | 
| | | 
| | 
| 

| | | --
| | 
| 
| | | Thanks & Regards,
| | 
| 
| | | Anil Gupta
| | 
| 

| --
| Thanks & Regards,
| Anil Gupta

Mime
View raw message