hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14269) Performance optimizations for data on S3
Date Thu, 03 Nov 2016 21:12:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634285#comment-15634285
] 

Sahil Takiar commented on HIVE-14269:
-------------------------------------

For those of you following this work. The original approach in this JIRA was to improve write
performance of Hive-on-S3 by staging all data on HDFS and then uploading it to S3 at the end
of the job. This approach was implemented in HIVE-14270 and after spending some time doing
a full scale benchmark, a few issues were uncovered.

Turns out, that when data is staged on HDFS, it is uploaded to S3 inside the {{MoveTask}}.
The {{MoveTask}} does a serial upload of the data from HDFS to S3. This upload can be pretty
inefficient since the current {{S3AOutputStream}} requires downloading all the data to the
local fs, and then issuing a S3 upload request to actually send the data to S3. This means
all data uploaded to S3 must go through HS2, which is not a scalable solution.

The {{MoveTask}} has some logic to trigger a distcp job if the amount of data to be copied
is large, but there are a number of issues with this approach. It seems that Hive has a few
bugs in the logic that determines it the distcp job is needed (see HIVE-14864). Distcp also
has a number of issues since each Map Task will rename files on S3 a few times. Furthermore,
using distcp would only be useful if it was modified to implement a "direct write approach"
(similar to what was suggested in HIVE-14271). After talking to a few members of community
(see HIVE-14271, HIVE-14535, and SPARK-10063), it seems that the direct write approach has
a large number of problems in production environments.

To summarize, if data all data from a query is staged in HDFS, there isn't a great way to
get the data from HDFS to S3.

Instead, an alternative approach would be as follows:
* Modify Hive so that all intermediate MR jobs write to HDFS, but the last MR job writes to
a scratch directory on S3
* HiveServer copies the data from the scratch directory on S3 to the final table location
on S3
** These copies will just copy data from one S3 location to another S3 location, so the copy
is done server-side, within S3
** HS2 just needs to issue COPY requests to S3 to get the data copied, this is essentially
just an HTTP request so it should be pretty lightweight

The hope is that this revised approach will offer better scalability when writing to S3.

Any thoughts / comments / suggestions / feedback on this is greatly appreciated.

> Performance optimizations for data on S3
> ----------------------------------------
>
>                 Key: HIVE-14269
>                 URL: https://issues.apache.org/jira/browse/HIVE-14269
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 2.1.0
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> Working with tables that resides on Amazon S3 (or any other object store) have several
performance impact when reading or writing data, and also consistency issues.
> This JIRA is an umbrella task to monitor all the performance improvements that can be
done in Hive to work better with S3 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message