hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <>
Subject [jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter
Date Wed, 25 Apr 2018 17:04:00 GMT


Steve Loughran commented on HIVE-16295:

bq. is there a reason PathOutputCommitterFactory doesn't provide a way to construct a PathOutputCommitter
using a JobContext rather than a TaskAttemptContext

I think it's because the only bits in hadoop & spark where committers were being constructed
with JobContext alone was in the v1 committers, which these committers don't (currently) support.
It just kept things simpler all round to not have to worry about two similar-but-slightly
different constructors.

bq. does the DirectoryOutputCommitter work with Spark SQL or just Spark? I'

should work as a drop in replacement for a normal hadoop FileOutputCommitter; its not being
clever the way the parititioned one is.

regarding dynamic partitioning, the S3A Committers do know which files they've created, which
is stuff that goes in the manifest. If you load in the _SUCCESS File and read that section,
you can infer it. If that works then create a hadoop JIRA "stabilize _SUCCESS format" and
we'll think about what we can say "will always be retained". 

Or is this file being created too late in your workflow?

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>                 Key: HIVE-16295
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}}
and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with S3Guard
and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If
Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means no renames
are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries
or speculative execution) should not step on each other

This message was sent by Atlassian JIRA

View raw message