Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C205CE4A for ; Wed, 12 Mar 2014 19:38:19 +0000 (UTC) Received: (qmail 31720 invoked by uid 500); 12 Mar 2014 19:38:10 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 31571 invoked by uid 500); 12 Mar 2014 19:38:09 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 31563 invoked by uid 99); 12 Mar 2014 19:38:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2014 19:38:08 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ados1984@gmail.com designates 209.85.219.54 as permitted sender) Received: from [209.85.219.54] (HELO mail-oa0-f54.google.com) (209.85.219.54) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2014 19:38:03 +0000 Received: by mail-oa0-f54.google.com with SMTP id n16so10706770oag.13 for ; Wed, 12 Mar 2014 12:37:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=Jx56AAsobBbmbTKiZxUNkXxSOhIMy5G6Bsk2PMu29yE=; b=hxAJgjAtYDundiEwC46H135LkFBmSyVe3MXUJFnyqUl2dnd6cU5N0k2j2ADPZJ0kyn PuUs30ZvWd4kyvTE5UOsju+KFQJeI5Ta/Zq714byeaV35H8PGrcBp3gq930yVFiD2MKd rhOg6/4EU2zYFjGfYaUcICopiF/Jfn+9JJnaIlMZckQ7FqtGwzHf3+Li7pwLIw/5reb2 Gbz8xrAkXcdGZrGadzmpphjXtPvqgQ0PsePEawphDluyhjTimQwhtr6w3Om0FDtny0B8 hAxV4wX/r5iX9YbtZFf0VXU787e8jY7hWy7z087TUjOq+TcnN4SDbdz/U8fjI8+vdTSE Ls3Q== X-Received: by 10.182.153.226 with SMTP id vj2mr40628400obb.26.1394653062907; Wed, 12 Mar 2014 12:37:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.227.169 with HTTP; Wed, 12 Mar 2014 12:37:22 -0700 (PDT) In-Reply-To: References: From: "ados1984@gmail.com" Date: Wed, 12 Mar 2014 15:37:22 -0400 Message-ID: Subject: Re: Use Cases for Structured Data To: user Content-Type: multipart/alternative; boundary=089e013d0dc066513004f46df852 X-Virus-Checked: Checked by ClamAV on apache.org --089e013d0dc066513004f46df852 Content-Type: text/plain; charset=ISO-8859-1 Thank you Shahab but it would be really nice if I can get some input on my initial question as it would really help. On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus wrote: > I would suggest that given the level of details that you are looking for > and fundamental nature of your questions, you should get hold of books or > online documentation. Basically some reading/research. > > Latest edition of > http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is > highly recommended to begin with. > > Regards, > Shahab > > > On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com wrote: > >> Hello Team, >> >> I am starting off on Hadoop eco-system and wanted to learn first based on >> my use case if Hadoop is right tool for me. >> >> I have only structured data and my goal is to safe this data into Hadoop >> and take benefit of replication factor. I am using Microsoft tools for >> doing analysis and it provides me with good drag and drop functionality for >> creating different kind of anaylsis and also it has hadoop drivers so it >> can have hadoop as data source for doing analysis. >> >> My question here is how benefits YARN architecture give me in tems of >> analysis that my Microsoft, Netezza of Tableau products are not giving me. >> I am just trying to understand value of introducing Hadoop in my >> Architecture in terms of Analysis apart from data replication. Any insights >> would be very helpful. >> >> Also, my goal for POC is related to efficient data storage/retrieval and >> so >> >> 1. how does data retrieval work in hadoop? >> 2. do i always need to have any kind of data source on top of hdfs >> like hbase/cassandra/mongo or there is not need for one and i can have all >> my data stored in hdfs directly and can retrieve them when i need by using >> different analytic tools that have hdfs as data source? >> 3. say if i have 3 node cluster, one master and 2 slaves and if am >> trying to insert data into hadoop then what is the cycle that framework >> performs to install my data into hdfs - does my process reads all meta data >> information from master node about where is my slaves nodes and what kind >> of data should go on which slave node or all data is send to master node >> and from there depending upon meta data information it reads and decides >> that what portion of data should be going to which node? >> 4. Also if i have 3 node cluster with 1 master and 2 slaves and if my >> data is equally distributed in two nodes and if i have replication set to 2 >> then where and how will replication take place as i do not have any node >> vacant for doing replication? >> 5. Also, for POC, does it make sense to go with Cloudera 3 node free >> cluster or Hortonworks 3 node free cluster or it makes sense to go with >> opensource hadoop version and if we go with open source hadoop version then >> where can we define that which is master node and which is slave node and >> also can we have all 3 nodes on same machine or we need to have all 3 nodes >> on different machines? >> 6. Also, what are the pros and cons with going through >> Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC point of >> view? >> 7. Also, if we go with Hortonworks/Cloudera then what all tools are >> come clubbed together with Hadoop framework and if we go with Apache >> Hadoop, do we get any tools like Pig, Hive clubbed together or we have to >> install them separately? >> >> Since am staring off on Hadoop Journey recently, I would really >> appreciate if community can point me in right direction? >> >> Regards, Andy. >> > > --089e013d0dc066513004f46df852 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thank you Shahab but it would be really nice if I can get = some input on my initial question as it would really help.=A0


On Wed, Mar 12, 2014 = at 3:11 PM, Shahab Yunus <shahab.yunus@gmail.com> wrote= :
I would suggest that given = the level of details that you are looking for and fundamental nature of you= r questions, you should get hold of books or online documentation. Basicall= y some reading/research.

Latest edition of http://www.amazon.com= /Hadoop-Definitive-Guide-Tom-White/dp/1449311520 is highly recommended = to begin with.

Regards,
Shahab


On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <ados1984@gmail.com>= wrote:
= Hello Team,=A0

I am starting off on Hadoop eco-system and wanted to learn first based on m= y use case if Hadoop is right tool for me.=A0

I have only= structured data and my goal is to safe this data into Hadoop and take bene= fit of replication factor. I am using Microsoft tools for doing analysis an= d it provides me with good drag and drop functionality for creating differe= nt kind of anaylsis and also it has hadoop drivers so it can have hadoop as= data source for doing analysis.=A0

My question= here is how benefits YARN architecture give me in tems of analysis that my= Microsoft, Netezza of Tableau products are not giving me. I am just trying= to understand value of introducing Hadoop in my Architecture in terms of A= nalysis apart from data replication. Any insights would be very helpful.=A0=

Also, my go= al for POC is related to efficient data storage/retrieval and so=A0
  1. how does data retrieval work in hadoop?
  2. do i always need to have any kind of data source on top of hdfs like hb= ase/cassandra/mongo or there is not need for one and i can have all my data= stored in hdfs directly and can retrieve them when i need by using differe= nt analytic tools that have hdfs as data source?
  3. say if i have 3 node cluster, one master and 2 slaves and if am trying = to insert data into hadoop then what is the cycle that framework performs t= o install my data into hdfs - does my process reads all meta data informati= on from master node about where is my slaves nodes and what kind of data sh= ould go on which slave node or all data is send to master node and from the= re depending upon meta data information it reads and decides that what port= ion of data should be going to which node?=A0
  4. Also if i have 3 node cluster with 1 master and 2 slaves and if my data= is equally distributed in two nodes and if i have replication set to 2 the= n where and how will replication take place as i do not have any node vacan= t for doing replication? =A0
  5. Also, for POC, does it make sense to go with Cloudera 3 node free clust= er or Hortonworks 3 node free cluster or it makes sense to go with opensour= ce hadoop version and if we go with open source hadoop version then where c= an we define that which is master node and which is slave node and also can= we have all 3 nodes on same machine or we need to have all 3 nodes on diff= erent machines?
  6. Also, what are the pros and cons with going through Hortonworks/Clouder= a as opposed to Apache Hadoop from initial POC point of view?
  7. Also,= if we go with Hortonworks/Cloudera then what all tools are come clubbed to= gether with Hadoop framework and if we go with Apache Hadoop, do we get any= tools like Pig, Hive clubbed together or we have to install them separatel= y?
Since am staring off on Hadoop Journey recently, I would really a= ppreciate if community can point me in right direction?

Regards, Andy.=A0


--089e013d0dc066513004f46df852--