hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6249) Streaming task will not untar tgz uploaded with -archives
Date Tue, 10 Feb 2015 14:49:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314254#comment-14314254
] 

Jason Lowe commented on MAPREDUCE-6249:
---------------------------------------

I should have clarified that /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/116/test.tgz
in the example above is actually a directory and not an archive.  Archives by default are
unpacked into a directory named after the archive name, although this can be renamed in the
task's working directory via the URI fragment (e.g.: '#test' in the example above).  So the
files in your example should be available via test/test.1.txt and test/test.2.txt from the
task's current working directory. 

> Streaming task will not untar tgz uploaded with -archives
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-6249
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6249
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 2.5.2
>         Environment: hadoop-2.5.2
> hadoop-streaming-2.5.2.jar
>            Reporter: Liu Xiao
>
> when writing hadoop streaming task. i used -archives to upload a tgz from local machine
to hdfs task working directory, but it has not been untarred as the document says. I've searched
a lot without any luck.
> Here is the hadoop streaming task starting command with hadoop-2.5.2
> hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar \
>     -files mapper.sh
>     -archives /home/hadoop/tmp/test.tgz#test \
>     -D mapreduce.job.maps=1 \
>     -D mapreduce.job.reduces=1 \
>     -input "/test/test.txt" \
>     -output "/res/" \
>     -mapper "sh mapper.sh" \
>     -reducer "cat"
> and "mapper.sh"
> cat > /dev/null
> ls -l test
> exit 0
> in "test.tgz" there is two files "test.1.txt" and "test.2.txt"
> echo "abcd" > test.1.txt
> echo "efgh" > test.2.txt
> tar zcvf test.tgz test.1.txt test.2.txt
> the output from above task
> lrwxrwxrwx 1 hadoop hadoop     71 Feb  8 23:25 test -> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/116/test.tgz
> but what desired may be like this
> -rw-r--r-- 1 hadoop hadoop 5 Feb  8 23:25 test.1.txt
> -rw-r--r-- 1 hadoop hadoop 5 Feb  8 23:25 test.2.txt
> so, why test.tgz has not been untarred automatically as document says, and or there is
actually another way makes the "tgz" being untarred



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message