hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <>
Subject [jira] [Created] (HIVE-16295) Add support for using Hadoop's OutputCommitter
Date Fri, 24 Mar 2017 20:21:41 GMT
Sahil Takiar created HIVE-16295:

             Summary: Add support for using Hadoop's OutputCommitter
                 Key: HIVE-16295
             Project: Hive
          Issue Type: Sub-task
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar

Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}}
and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.

The Hadoop community is building a {{OutputCommitter}} that integrates with S3Guard and does
a safe, coordinate commit of data on S3 inside individual tasks. If Hive can integrate with
this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3:

* Data is only written once; directly committing data at a task level means no renames are
* The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or
speculative execution) should not step on each other
* Data is written within each task, so everything in does in parallel

This message was sent by Atlassian JIRA

View raw message