hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shanyu zhao (JIRA)" <>
Subject [jira] [Commented] (HIVE-7288) Enable support for -libjars and -archives in WebHcat for Streaming MapReduce jobs
Date Fri, 27 Jun 2014 00:18:26 GMT


shanyu zhao commented on HIVE-7288:

>From reading the description and look at hadoop streaming document, I think we need to
add the following parameters to mapreduce/streaming endpoint:

FYI this is the current list of parameters for streaming endpoint:
@FormParam("output") String output,
@FormParam("mapper") String mapper,
@FormParam("reducer") String reducer,
@FormParam("combiner") String combiner,
@FormParam("file") List<String> fileList,
@FormParam("files") String files,
@FormParam("define") List<String> defines,
@FormParam("cmdenv") List<String> cmdenvs,
@FormParam("arg") List<String> args,
@FormParam("statusdir") String statusdir,
@FormParam("callback") String callback,
@FormParam("enablelog") boolean enablelog

> Enable support for -libjars and -archives in WebHcat for Streaming MapReduce jobs
> ---------------------------------------------------------------------------------
>                 Key: HIVE-7288
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: WebHCat
>    Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1
>         Environment: HDInsight deploying HDP 2.1;  Also HDP 2.1 on Windows 
>            Reporter: Azim Uddin
>            Assignee: shanyu zhao
> Issue:
> ======
> Due to lack of parameters (or support for) equivalent of '-libjars' and '-archives' in
WebHcat REST API, we cannot use an external Java Jars or Archive files with a Streaming MapReduce
job, when the job is submitted via WebHcat/templeton. 
> I am citing a few use cases here, but there can be plenty of scenarios like this-
> #1 
> (for -archives):In order to use R with a hadoop distribution like HDInsight or HDP on
Windows, we could package the R directory up in a zip file and rename it to r.jar and put
it into HDFS or WASB. We can then do 
> something like this from hadoop command line (ignore the wasb syntax, same command can
be run with hdfs) - 
> hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar -archives wasb:///example/jars/r.jar
-files "wasb:///example/apps/mapper.r,wasb:///example/apps/reducer.r" -mapper "./r.jar/bin/Rscript.exe
mapper.r" -reducer "./r.jar/bin/Rscript.exe reducer.r" -input /example/data/gutenberg -output
> This works from hadoop command line, but due to lack of support for '-archives' parameter
in WebHcat, we can't submit the same Streaming MR job via WebHcat.
> #2 (for -libjars):
> Consider a scenario where a user would like to use a custom inputFormat with a Streaming
MapReduce job and wrote his own custom InputFormat JAR. From a hadoop command line we can
do something like this - 
> hadoop jar /path/to/hadoop-streaming.jar \
>         -libjars /path/to/custom-formats.jar \
>         -D map.output.key.field.separator=, \
>         -D mapred.text.key.partitioner.options=-k1,1 \
>         -input my_data/ \
>         -output my_output/ \
>         -outputformat test.example.outputformat.DateFieldMultipleOutputFormat \
>         -mapper \
>         -reducer \
> But due to lack of support for '-libjars' parameter for streaming MapReduce job in WebHcat,
we can't submit the above streaming MR job (that uses a custom Java JAR) via WebHcat.
> Impact:
> ========
> We think, being able to submit jobs remotely is a vital feature for hadoop to be enterprise-ready
and WebHcat plays an important role there. Streaming MapReduce job is also very important
for interoperability. So, it would be very useful to keep WebHcat on par with hadoop command
line in terms of streaming MR job submission capability.
> Ask:
> ====
> Enable parameter support for 'libjars' and 'archives' in WebHcat for Hadoop streaming
jobs in WebHcat.

This message was sent by Atlassian JIRA

View raw message