hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Olias Sanz" <alfonso.olias.s...@gmail.com>
Subject Re: [core-user] Move application to Map/Reduce architecture with Hadoop
Date Mon, 17 Mar 2008 09:48:56 GMT
Hi Stu

thanks for the link.

Well, I should have been more precise. I do not think we will have 1
millon files. But what I wanted to say was that we need to set up a
cluster in a way that the data is distributed among all the data
nodes. Instead of replicated.   Thats the first step, the second is to
think in fault tolreance/replication of the data nodes.

Rui pointed to have a look at the "distcp"
I will also have a look at the number of mappers per node.


On 15/03/2008, Stu Hood <stuhood@mailtrust.com> wrote:
> Hello Alfonso,
>  A Hadoop namenode with a reasonable amount of memory (4-8GB) should be able to handle
a few million files (but I can't find a reference to back that statement up: anyone?).
>  Unfortunately, the Hadoop jobtracker is not currently capable of handling jobs with
that many inputs. See https://issues.apache.org/jira/browse/HADOOP-2119
>  The only approach I know of currently (without modifying your input files) is the following,
but it has the nasty side-effect of losing all input locality for the tasks:
>  Do all of your processing in the Map tasks, and implement them a lot like the distcp
tool is implemented: the input for your Map task is a file containing a list of your files
to be processed, one per line. You then use the Hadoop Filesystem API to read and process
the files, and write your outputs. You would then set the number of Reducers for the job to
0, since you won't have any key->value output from the Mappers.
>  Thanks,
>  Stu
>  -----Original Message-----
>  From: Alfonso Olias Sanz <alfonso.olias.sanz@gmail.com>
>  Sent: Friday, March 14, 2008 8:05pm
>  To: core-user@hadoop.apache.org
>  Subject: [core-user] Move application to Map/Reduce architecture with Hadoop
>  Hi
>  I have just started using hadoop and HDFS.  I have done the WordCount
>  test application which gets some input files, process the files, and
>  generates and output file.
> I have a similar application, that has a million input files and has
>  to produce a million output files. The correlation between
>  input/output is 1:1.
> This process is suitable to run with a Map/Reduce approach.  But I
>  have several doubts I hope some body can answer me.
>  *** What should be the Reduce function??? Because there is no merge of
>  data of the running map processes.
> *** I set up a 2nodes cluster for the WordCount test. They work as
>  master+slave, slave. Before  launching the process I copied the files
>  using $HADOOP_HOME/bin/hadoop dfs -copyFromLocal <sourceFiles>
>  <destination>
> This files were replicated in both nodes. Is it there anyway to avoid
>  the files being replicated to all the nodes and instead, have them
>  distributed among all the nodes. With no replication of files.
> *** During the WordCount test, 25 map jobs were launched!  For our
>  application is overkilling. We have done several performance tests
>  without using hadoop and we have seen  that we can launch 1
>  application per core.  So is it there anyway to configure the cluster
>  in order to launch a number of tasks per core. So depending on
>  dual-core or quad core pcs, the number of running processes will be
>  different.
> Thanks in advance.
>  alf.

View raw message