hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HadoopIsNot" by SteveLoughran
Date Fri, 26 Oct 2012 17:30:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HadoopIsNot" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/HadoopIsNot?action=diff&rev1=8&rev2=9

  
  Hadoop stores data in files, and does not index them. If you want to find something, you
have to run a MapReduce job going through all the data. This takes time, and means that you
cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the
data is too big for a database (i.e. you have reached the technical limits, not just that
you don't want to pay for a database license). With very large datasets, the cost of regenerating
indexes is so high you can't easily index changing data. With many machines trying to write
to the database, you can't get locks on it. Here the idea of vaguely-related files in a distributed
filesystem can work.
  
- There is a project adding a column-table database on top of Hadoop - [[HBase]].
+ There is a high performance column-table database that runs on top of Hadoop HDFS: Apache
[[HBase]]. This is a great place to keep the results extracted from your original data.
  
  == MapReduce is not always the best algorithm ==
  
@@ -49, +49 @@

  
  This is important. If you don't know these, you are out of your depth and should not start
installing Hadoop until you have the basics of a couple of linux systems up and running, letting
you ssh in to each of them without entering a password, know each other's hostname and such
like. The Hadoop installation documents all assume you can do these things, and aren't going
to bother explaining about them.
  
- == Hadoop Filesystem is not a substitute for a High Availability SAN-hosted FS ==
- 
- There are some very high-end filesystems out there: GPFS, Lustre, which offer fantastic
data availability and performance, usually by requiring high end hardware (SAN and infiniband
networking, RAID storage). Hadoop HDFS cheats, delivering high local data access rates by
running code near the data, instead of being fast at shipping the data remotely. Instead of
using RAID controllers, it uses non-RAIDed storage across multiple machines.
- 
- HDFS is not (currently) Highly Available. The Namenode is a [[SPOF]].  There is work underway
to fix this short-coming.  However, there is no realistic time frame as to when that work
will be available in a stable release.
- 
- Because of these limitations, if you want a  filesystem that is always available, HDFS is
not yet there. You can run Hadoop MapReduce over other filesystems, however.
  
  == HDFS is not a POSIX filesystem ==
  
- The Posix filesystem model has files that can appended too, seek() calls made, files locked.
Hadoop is only just adding (in July 2009) append() operations, and seek() operations throw
away a lot of performance. You cannot seamlessly map code that assumes that all filesystems
are Posix-compatible to HDFS.
+ The Posix filesystem model has files that can appended too, seek() calls made, files locked.You
cannot seamlessly map code that assumes that all filesystems are Posix-compatible to HDFS.
  

Mime
View raw message