hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <>
Subject [jira] [Commented] (HIVE-15121) Last MR job in Hive should be able to write to a different scratch directory
Date Fri, 04 Nov 2016 23:40:59 GMT


Sahil Takiar commented on HIVE-15121:

Cleaned up the patch, and tested it more thoroughly. Test failures should be resolved by the
next Hive QA run. Ready for review.


* Approach is to find all the places where the scratch dir is specified for the final MR /
Tez / Spark job and modify the {{Context.getTempDirForPath}} to take in an optional boolean
{{isFinalJob}} that specifies the scratch directory is being made for the final MR job
* Adding a new config called {{}} that toggles
this behavior in case a user wants all intermediate data to be stored on HDFS
* Modified a few invocations of the {{getTempDirForPath}} in the {{SemanticAnalyzer.genFileSinkPlan}}
- this method create the final {{FileSinkDesc}} for the job 
* Tested locally against an S3 bucket; the explain output of Hive query with two MR jobs shows
that the first one writes to a local file, and the second writes to S3

Will do some more local validation, and writes some unit tests + qtests.

> Last MR job in Hive should be able to write to a different scratch directory
> ----------------------------------------------------------------------------
>                 Key: HIVE-15121
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sahil Takiar
>         Attachments: HIVE-15121.WIP.1.patch, HIVE-15121.WIP.2.patch, HIVE-15121.WIP.patch,
> Hive should be able to configure all intermediate MR jobs to write to HDFS, but the final
MR job to write to S3.
> This will be useful for implementing parallel renames on S3. The idea is that for a mutli-job
query, all intermediate MR jobs write to HDFS, and then the final job writes to S3. Writing
to HDFS should be faster than writing to S3, so it makes more sense to write intermediate
data to HDFS.

This message was sent by Atlassian JIRA

View raw message