hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Wenzel <mwen...@proheris.de>
Subject Looking for documentation/guides on Hadoop 2.7.2
Date Thu, 09 Jun 2016 09:15:27 GMT
Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand
the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading
a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea
of how components work together, but for me the book didn't helped me to understand what's
going on. In my opinion the book described pretty much general in depth details about various
components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
After doing this I started to work with files on it. I wrote my first simple mapper and reducer,
and I analyzed my apache log for some testing. This worked good so far.

But let's face my problems:
1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran
some shell-scripts and everything was running fine. Well, I have no clue at all, which components
are now installed on the VM and where are they located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide
on some point says "Now check that you can ssh to the localhost without a passphrase" "If
you cannot ssh to localhost without a passphrase, execute the following commands:". Well,
I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and
WHY do this have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce job locally.
If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $
bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is
where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone
I understood the storing of data, splitting files in blocks, spread files around the cluster,
store metadata, but if someone asks me: "How can this be called filesystem if you install
it by unpacking a .tar.gz?" I simply can't answer this question in any way.

So I'm now looking for a documentation/guide for:
- Which requirements do I have?
-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
-- How should I partition my VM?
-- On which partition should I install which components?
- Setting up a VM with Hadoop
- Configure Hadoop step by step
- Setup all kinds of deamons/nodes manually and explain where are they located (how they work)
and how they should be configured

I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
but after some first readings this Guide will tell you what to write in which configuration-file,
but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after
getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om
the road.
For me it's very important not just to write some configuration somewhere. I need to understand
what's going on because if I got a running cluster and things, I need to be sure to handle
all this stuff before going in productive use with it.

Best Regards

View raw message