hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter
Date Thu, 31 May 2018 19:30:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497043#comment-16497043
] 

Sahil Takiar commented on HIVE-16295:
-------------------------------------

Attached an update patch with a few more bug-fixes. Did some testing on an actual cluster
with Hive-on-MR and HoS and things seem be working as expected. What's next? More integration
testing - haven't done much scale or concurrency testing.

[~stevel@apache.org] there are some APIs that I'm using in this patch that are marked as private,
any objections to using them?
* {{PathOutputCommitterFactory}} using this class to instantiate the correct {{PathOutputCommitter}}
* {{InternalCommitterConstants.FS_S3A_COMMITTER_STAGING_UUID}} using this to set a unique
id for the staging directory; since Hive's write logic isn't integrated with MapReduce / Spark,
we need to supply our our staging UUID

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>
>                 Key: HIVE-16295
>                 URL: https://issues.apache.org/jira/browse/HIVE-16295
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, HIVE-16295.3.WIP.patch,
HIVE-16295.4.patch, HIVE-16295.5.patch, HIVE-16295.6.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}}
and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with S3Guard
and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If
Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means no renames
are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries
or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message