hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HadoopIsNot" by SteveLoughran
Date Tue, 21 Jul 2009 11:12:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/HadoopIsNot

The comment on the change is:
new page: Hadoop is Not

New page:
= What Hadoop is Not =

We see a lot of emails where people hear about Hadoop, and think it will be the silver bullet
to solve all their application/datacentre problems. It is not. It solves some specific problems
for some companies and organisations, but only after they have understood the technology and
where it is appropriate. If you start using Hadoop in the belief it is a drop-in replacement
for your database or SAN filesystem, you will be disappointed.


== Hadoop is not a substitute for a database ==

Databases are wonderful. Issue an SQL SELECT call against an indexed/tuned database and the
response comes back in milliseconds. Want to change that data? SQL UPDATE and the change is
in. Hadoop does not do this.

Hadoop stores data in files, and does not index them. If you want to find something, you have
to run a MapReduce job going through all the data. This takes time, and means that you cannot
directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is
too big for a database (i.e. you have reached the technical limits, not just that you don't
want to pay for a database license). With very large datasets, the cost of regenerating indexes
is so high you can't easily index changing data. With many machines trying to write to the
database, you can't get locks on it. Here the idea of vaguely-related files in a distributed
filesystem can work.

There is a project adding a column-table database on top of Hadoop - ["HBase"].

== MapReduce is not always the best algorithm ==

MapReduce is a profound idea: taking a simple functional programming operation and applying
it, in parallel, to gigabytes or terabytes of data. But there is a price. For that parallelism,
you need to have each MR operation independent from all the others. If you need to know everything
that has gone before, you have a problem. Such problems can be aided by

* Iteration: run multiple MR jobs, with the output of one being the input to the next.
* Shared state information. ["HBase"] is an option to consider here, otherwise something like
memcache is an option.

Do not try to remember things in shared variables, as they are only remembered in a single
JVM, for the life of that JVM. That is the wrong way to work in a massively parallel environment.

== Hadoop and MapReduce is not a place to learn Java programming ==

There are currently a lot of assumptions in the Hadoop APIs and documentation, assumptions
that you know the basics of Java programming, and of the common error messages you get when
things don't work. If you do not know about classpaths, how to compile and debug Java code,
step back from Hadoop and learn a bit more about Java before proceeding.

== Hadoop is not an ideal place to learn networking error messages ==

You will find things work a lot easier if you are already familiar with networking and the
common error messages -for example, what "Connection Refused" means, and how is different
from "No Route to Host".

A lot of people post onto the user list with problems related to "connection refused", "No
Route to Host" and other common TCP-IP level errors. These are usually signs of an invalid
cluster configuration, some parts of the cluster not running, or machines not being able to
talk to each other on the LAN. People on the mailing list cannot debug your network configuration
for you, as it is your network, not theirs. They can point you at some of the tools and tests
to try, but since it will take a day for every email round trip, you won't find this a very
fast way to get help.

Nobody in the Hadoop team are deliberately trying to make things hard, its just that when
things do not work in a large distributed system, you get some interesting error messages.
If you can help improve those network messages or diagnostics, we would love to have that
code.


== Hadoop clusters are not a place to learn Unix/Linux system administration ==

You need to know your way round a Unix/Linux system. How to install it, what the various files
in /etc/ are for, how to set up networking, a good hosts table, debug DNS problems, why to
keep logs on a separate disk from the root disk, etc. If you cannot look after a single machine,
you aren't going to be able to handle a cluster of 80 of them. That said, don't try maintaining
those 80+ boxes using the same technique of hand-editing files lile ["/etc/hosts"], because
it doesn't scale.

== Hadoop Filesystem is not a substitute for a High Availability SAN-hosted FS ==

There are some very high-end filesystems out there: GPFS, Lustre, which offer fantastic data
availability and performance, usually by requiring high end hardware (SAN and infiniband networking,
RAID storage). Hadoop HDFS cheats, delivering high local data access rates by running code
near the data, instead of being fast at shipping the data remotely. Instead of using RAID
controllers, it uses non-RAIDed storage across multiple machines.

* It is not (currently) Highly Available. The Namenode is a ["SPOF"].
* It does not (currently) offer real security. It is probably less secure than Sun's original
NFS filesystem.

Because of these limitations, if you want a secure filesystem that is always available, HDFS
is not yet there. You can run Hadoop MapReduce over other filesystems, however.

== HDFS is not a Posix filesystem ==

The Posix filesystem model has files that can appended too, seek() calls made, files locked.
Hadoop is only just adding (in July 2009) append() operations, and seek() operations throw
away a lot of performance. You cannot seamlessly map code that assumes that all filesystems
are Posix-compatible to HDFS.

Mime
View raw message