hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ados1984@gmail.com" <ados1...@gmail.com>
Subject Use Cases for Structured Data
Date Wed, 12 Mar 2014 19:07:46 GMT
Hello Team,

I am starting off on Hadoop eco-system and wanted to learn first based on
my use case if Hadoop is right tool for me.

I have only structured data and my goal is to safe this data into Hadoop
and take benefit of replication factor. I am using Microsoft tools for
doing analysis and it provides me with good drag and drop functionality for
creating different kind of anaylsis and also it has hadoop drivers so it
can have hadoop as data source for doing analysis.

My question here is how benefits YARN architecture give me in tems of
analysis that my Microsoft, Netezza of Tableau products are not giving me.
I am just trying to understand value of introducing Hadoop in my
Architecture in terms of Analysis apart from data replication. Any insights
would be very helpful.

Also, my goal for POC is related to efficient data storage/retrieval and so

   1. how does data retrieval work in hadoop?
   2. do i always need to have any kind of data source on top of hdfs like
   hbase/cassandra/mongo or there is not need for one and i can have all my
   data stored in hdfs directly and can retrieve them when i need by using
   different analytic tools that have hdfs as data source?
   3. say if i have 3 node cluster, one master and 2 slaves and if am
   trying to insert data into hadoop then what is the cycle that framework
   performs to install my data into hdfs - does my process reads all meta data
   information from master node about where is my slaves nodes and what kind
   of data should go on which slave node or all data is send to master node
   and from there depending upon meta data information it reads and decides
   that what portion of data should be going to which node?
   4. Also if i have 3 node cluster with 1 master and 2 slaves and if my
   data is equally distributed in two nodes and if i have replication set to 2
   then where and how will replication take place as i do not have any node
   vacant for doing replication?
   5. Also, for POC, does it make sense to go with Cloudera 3 node free
   cluster or Hortonworks 3 node free cluster or it makes sense to go with
   opensource hadoop version and if we go with open source hadoop version then
   where can we define that which is master node and which is slave node and
   also can we have all 3 nodes on same machine or we need to have all 3 nodes
   on different machines?
   6. Also, what are the pros and cons with going through
   Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of
   7. Also, if we go with Hortonworks/Cloudera then what all tools are come
   clubbed together with Hadoop framework and if we go with Apache Hadoop, do
   we get any tools like Pig, Hive clubbed together or we have to install them

Since am staring off on Hadoop Journey recently, I would really appreciate
if community can point me in right direction?

Regards, Andy.

View raw message