hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@mailtrust.com>
Subject RE: [core-user] Move application to Map/Reduce architecture with Hadoop
Date Sat, 15 Mar 2008 01:53:32 GMT
Hello Alfonso,

A Hadoop namenode with a reasonable amount of memory (4-8GB) should be able to handle a few
million files (but I can't find a reference to back that statement up: anyone?).

Unfortunately, the Hadoop jobtracker is not currently capable of handling jobs with that many
inputs. See https://issues.apache.org/jira/browse/HADOOP-2119

The only approach I know of currently (without modifying your input files) is the following,
but it has the nasty side-effect of losing all input locality for the tasks:

Do all of your processing in the Map tasks, and implement them a lot like the distcp tool
is implemented: the input for your Map task is a file containing a list of your files to be
processed, one per line. You then use the Hadoop Filesystem API to read and process the files,
and write your outputs. You would then set the number of Reducers for the job to 0, since
you won't have any key->value output from the Mappers.


-----Original Message-----
From: Alfonso Olias Sanz <alfonso.olias.sanz@gmail.com>
Sent: Friday, March 14, 2008 8:05pm
To: core-user@hadoop.apache.org
Subject: [core-user] Move application to Map/Reduce architecture with Hadoop


I have just started using hadoop and HDFS.  I have done the WordCount
test application which gets some input files, process the files, and
generates and output file.

I have a similar application, that has a million input files and has
to produce a million output files. The correlation between
input/output is 1:1.

This process is suitable to run with a Map/Reduce approach.  But I
have several doubts I hope some body can answer me.

*** What should be the Reduce function??? Because there is no merge of
data of the running map processes.

*** I set up a 2nodes cluster for the WordCount test. They work as
master+slave, slave. Before  launching the process I copied the files
using $HADOOP_HOME/bin/hadoop dfs -copyFromLocal <sourceFiles>

This files were replicated in both nodes. Is it there anyway to avoid
the files being replicated to all the nodes and instead, have them
distributed among all the nodes. With no replication of files.

*** During the WordCount test, 25 map jobs were launched!  For our
application is overkilling. We have done several performance tests
without using hadoop and we have seen  that we can launch 1
application per core.  So is it there anyway to configure the cluster
in order to launch a number of tasks per core. So depending on
dual-core or quad core pcs, the number of running processes will be

Thanks in advance.

View raw message