hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <>
Subject [jira] [Commented] (HIVE-15121) Last MR job in Hive should be able to write to a different scratch directory
Date Fri, 18 Nov 2016 06:51:59 GMT


Sahil Takiar commented on HIVE-15121:

[~spena] unfortunately HIVE-15226 doesn't help much for this patch. HIVE-15226 basically replaces
the blobstore URI with {{### test.blobstore.path ###}}, but this URI mainly occurs when listing
out file names; for example, when listing out the staging directory of an MR job. The problem
is that the staging directory values get replaced by {{QTestUtil}}. The {{QTestUtil}} class
matches for {{.*.hive-staging.*}} and replaces it with {{#### A masked pattern was here ####}}.
It does this for good reasons, the staging directory typically has some non-deterministic
id in the file path.

For this specific patch, the {{EXPLAIN EXTENDED}} outputs for a mutli-MR job query end up
being the exact same when this optimization is enabled vs. when it is disabled. Mainly because
of the behavior above.

One easy way to fix this would be to match on {{.*s3a:.*}} and replaces it with {{### test.blobstore.path
###}}; {{QTestUtil}} already does for {{.*hdfs:.*}} and {{.*file:.*}}.

Let me know what you think of this approach, I can add the changes to this patch.

> Last MR job in Hive should be able to write to a different scratch directory
> ----------------------------------------------------------------------------
>                 Key: HIVE-15121
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>         Attachments: HIVE-15121.1.patch, HIVE-15121.WIP.1.patch, HIVE-15121.WIP.2.patch,
HIVE-15121.WIP.patch, HIVE-15121.patch
> Hive should be able to configure all intermediate MR jobs to write to HDFS, but the final
MR job to write to S3.
> This will be useful for implementing parallel renames on S3. The idea is that for a mutli-job
query, all intermediate MR jobs write to HDFS, and then the final job writes to S3. Writing
to HDFS should be faster than writing to S3, so it makes more sense to write intermediate
data to HDFS.

This message was sent by Atlassian JIRA

View raw message