hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ados1984@gmail.com" <ados1...@gmail.com>
Subject Re: Use Cases for Structured Data
Date Thu, 13 Mar 2014 13:50:45 GMT
okies, thank you D, i will start playing around with the Sandbox version.

On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <drdwitte@gmail.com> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
> 2014-03-12 21:23 GMT+01:00 ados1984@gmail.com <ados1984@gmail.com>:
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>> Regards, Andy.
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <drdwitte@gmail.com>wrote:
>>> Hi,
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>> Regards, D
>>> 2014-03-12 20:37 GMT+01:00 ados1984@gmail.com <ados1984@gmail.com>:
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus <shahab.yunus@gmail.com>wrote:
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of
>>>>> or online documentation. Basically some reading/research.
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is
highly recommended to begin with.
>>>>> Regards,
>>>>> Shahab
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, ados1984@gmail.com <
>>>>> ados1984@gmail.com> wrote:
>>>>>> Hello Team,
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft
>>>>>> for doing analysis and it provides me with good drag and drop functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>> My question here is how benefits YARN architecture give me in tems
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any
>>>>>> would be very helpful.
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and
i can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when
i need by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that
>>>>>>    performs to install my data into hdfs - does my process reads
all meta data
>>>>>>    information from master node about where is my slaves nodes and
what kind
>>>>>>    of data should go on which slave node or all data is send to master
>>>>>>    and from there depending upon meta data information it reads and
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have replication
>>>>>>    set to 2 then where and how will replication take place as i do
not have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense
to go
>>>>>>    with opensource hadoop version and if we go with open source hadoop
>>>>>>    then where can we define that which is master node and which is
slave node
>>>>>>    and also can we have all 3 nodes on same machine or we need to
have all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial
POC point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or
we have to
>>>>>>    install them separately?
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>> Regards, Andy.

View raw message