hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anu Engineer <aengin...@hortonworks.com>
Subject Re: Looking for documentation/guides on Hadoop 2.7.2
Date Thu, 09 Jun 2016 18:33:39 GMT
Hi Mike,

I am sorry your experience with setting up Hadoop has been frustrating and mysterious. I will
try to give partial answers / pointers to where you should be looking. Please be patient with
me.


Ø  After reading the book I had a first idea of how components work together, but for me
the book didn't helped me to understand what’s going on

I generally recommend this book to anyone starting off with Hadoop, and IMHO it is the best
book for an overview of Hadoop.



Ø  All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran
some shell-scripts and everything was running fine

I would have presumed that book tells you the about the various components – HDFS, MapReduce,
YARN etc. if you are using edition 4 of the book please look at Chapter 2, 3, & 4.



Ø  Furthermore, I'm missing all kinds of information about setting those up. The apache guide
on some point says "Now check that you can ssh to the localhost without a passphrase" "

Thank you for the feedback. Hadoop relies on an underlying operating system – For example
– Hadoop generally runs on top of Linux. We assume that you understand these underlying
layers.

SSH is something that is used extensively in Linux world– and you run into a problem like
this – google is your friend. I just typed SSH into google – and this was the first link
-- https://en.wikipedia.org/wiki/Secure_Shell and the section on public key/private key (another
time to reach out to your friend google if you don’t understand how that works) which explains
how password less logins work. I understand your frustration, but explaining SSH in our documentation
would just frustrate most of our users.



Ø  following instructions are to run a MapReduce job locally. If you want to execute a job
on YARN, see YARN on Single Node."

Please take a look at Chapter 2 of the “Definitive Guide”, and for YARN please look at
chapter 4. It has excellent explanations on both of these.  If you are saying that Apache’s
documentation is not as great as these external resources, yes, we are aware of that. Would
you like to help us address that?



Ø  Format the filesystem: $ bin/hdfs namenode -format". I have no clue how HDFS internally
work.

Isn’t that the beauty of file systems and databases in general? that you don’t have to
master the intricate details of B+ Trees or Query Optimization. Since we are open source –
We encourage people like you who would like to understand more to read source.



Ø  For me a Filesystem is where I can setup partitions hooked on folders. So how am I supposed
to explain hdfs to someone else?

Not to offend you, but I feel that it is a very limited world view of file systems.  There
are a large number of files systems – beyond the ones that have partitions. If you are asking
why HDFS should be treated as a file system, the simplest answer is that it offers Posix file
system semantics. That is, it looks and acts like a file system.



Ø   "How can this be called filesystem if you install it by unpacking a .tar.gz?" I simply
can't answer this question in any way.

Again google is your friend. https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
Please take a look at different distributed file systems to get an expanded perspective of
file systems.




> So I'm now looking for a documentation/guide for:
> - Which requirements do I have?

It is very difficult to answer given the current lack of advancement in mind reading ☺ (Just
kidding). You have a problem to solve – and most problems can be broken down to a programming
pattern offered by the Hadoop eco-system.

You might have a big data storage problem – HDFS might be your solution, you might want
to run computations on top of it, MapReduce, Spark etc. might be your solution. You want to
have a scalable key value store – HBase might help you,
You want to run SQL like queries – Hive might offer you a solution. If you have specific
problem, you can post the question to the user group and someone will generally answer your
question.

-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
Any file system that can store large number of files work well. We have seen HDFS work on
top of all kinds of file systems. Ext4, XFS etc. etc.  In other words, use the physical file
system you like and if you run into any issues please report it here or in the dev group.

-- How should I partition my VM?
Generally, we do not recommend you to run in VMs. The book that you were referring to has
a section called Part 3. Hadoop Operations. The first few chapters’ deal with it. Or else
reaching out to our friend google and asking “hadoop cluster hardware” gives me many links
and recommendations. I would read the blogs from Cloudera and Hortonworks.

-- On which partition should I install which components?
Again, this is a very specific question that needs us to understand your cluster configuration.
The general answer is keep your Hadoop binaries, conf and logs separate from your data files.
Protect your data directories with physical file system permissions and run Hadoop under a
specific user like hdfs, yarn etc.

- Setting up a VM with Hadoop
Apache does not recommend that you run it under VMs. If you really want to do this, you might
want to look at documentation provided by virtualization/cloud providers like VMware, Windows
Azure or Amazon EMR.

- Configure Hadoop step by step
Please let us know what is challenging for you in the current set of instructions. Are you
able to setup single instance, pseudo instance and then progress to a cluster setup?.
I can sense a great deal of frustration, but I am not able to help you unless I specifically
know what is bothering you.

- Setup all kinds of deamons/nodes manually and explain where are they located (how they work)
and how they should be configured
Go to your Hadoop directory and go to sbin. Read the source of start-dfs.sh or start-all.sh.
It gives you pointers to what services are running, or run start-dfs.sh and then run “sudo
jps” to see the running services.


Thanks
Anu














From: Mike Wenzel <mwenzel@proheris.de>
Date: Thursday, June 9, 2016 at 2:15 AM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Looking for documentation/guides on Hadoop 2.7.2

Hey everyone. I just started some weeks ago to learn about Hadoop. I got the task to understand
the Hadoop Ecosystem, and be able to answer some questions. First of all I started reading
a book "OReilly - Hadoop The Definitive Guide". After reading the book I had a first idea
of how components work together, but for me the book didn't helped me to understand what’s
going on. In my opinion the book described pretty much general in depth details about various
components. This didn't helped me to understand the Hadoop Ecosystem.

I started to work with it. I installed a VM (SUSE Leap 42.1) and followed the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
Guide.
After doing this I started to work with files on it. I wrote my first simple mapper and reducer,
and I analyzed my apache log for some testing. This worked good so far.

But let’s face my problems:
1) All my knowledge about the Installing of Hadoop right now is: Unpacking a .tar.gz. I ran
some shell-scripts and everything was running fine. Well, I have no clue at all, which components
are now installed on the VM and where are they located and installed?

2) Furthermore, I'm missing all kinds of information about setting those up. The apache guide
on some point says "Now check that you can ssh to the localhost without a passphrase" "If
you cannot ssh to localhost without a passphrase, execute the following commands:". Well,
I'd like to know what am I doing here ?! I mean WHY do I need ssh running on localhost, and
WHY do this have to be without a passphrase. Which other ways of configuring this do exists?

3) Same on the next point: "The following instructions are to run a MapReduce job locally.
If you want to execute a job on YARN, see YARN on Single Node." "Format the filesystem: $
bin/hdfs namenode -format". I have no clue how HDFS internally work. For me a Filesystem is
where I can setup partitions hooked on folders. So how am I supposed to explain hdfs to someone
else?
I understood the storing of data, splitting files in blocks, spread files around the cluster,
store metadata, but if someone asks me: "How can this be called filesystem if you install
it by unpacking a .tar.gz?" I simply can't answer this question in any way.

So I'm now looking for a documentation/guide for:
- Which requirements do I have?
-- Does I have to use a specific Filesystem? If yes/no, why or what would you recommend?
-- How should I partition my VM?
-- On which partition should I install which components?
- Setting up a VM with Hadoop
- Configure Hadoop step by step
- Setup all kinds of deamons/nodes manually and explain where are they located (how they work)
and how they should be configured

I'm right now reading: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
but after some first readings this Guide will tell you what to write in which configuration-file,
but now why you should do this or not. I'm feeling like "leaved alone in the darkness" after
getting an idea of what Hadoop is. I hope some of you can show me some ways to get back om
the road.
For me it's very important not just to write some configuration somewhere. I need to understand
what's going on because if I got a running cluster and things, I need to be sure to handle
all this stuff before going in productive use with it.

Best Regards
Mike
Mime
View raw message