hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xeon Mailinglist <xeonmailingl...@gmail.com>
Subject Where is the temp output data of a map or reduce tasks
Date Thu, 11 Aug 2016 16:38:23 GMT
With MapReduce v2 (Yarn), the output data that comes out from a map or a
reduce task is saved in the local disk or the HDFS when all the tasks
finish.

Since tasks end at different times, I was expecting that the data were
written as a task finish. For example, task 0 finish and so the output is
written, but task 1 and task 2 are still running. Now task 2 finish the
output is written, and task 1 is still running. Finally, task 1 finish and
the last output is written. But this does not happen. The outputs only
appear in the local disk or HDFS when all the tasks finish.

I want to access the task output as the data is being produced. Where is
the output data before all the tasks finish?


After I have set these params in `mapred-site.xml`


<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>

<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>

I still can't found where the intermediate output or the final output is
saved as they are produced by the tasks.

I have listed all directories in `hdfs dfs -ls -R /` and in the `tmp` dir I
have only found the job configuration files.

    drwx------   - root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
    -rw-r--r--   1 root supergroup          0 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
    -rw-r--r--  10 root supergroup     112872 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
    -rw-r--r--  10 root supergroup       6641 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
    -rw-r--r--   1 root supergroup        797 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
    -rw-r--r--   1 root supergroup      88675 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
    -rw-r--r--   1 root supergroup     439848 2016-08-11 16:17
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
    -rw-r--r--   1 root supergroup     105176 2016-08-11 16:14
/tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml

 Where is the output saved? I am talking about the output that it is stored
as it is being produced by the tasks, and not the final output that comes
when all map or reduce tasks finish.

Mime
View raw message