hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Commented] (HIVE-14535) add micromanaged tables to Hive (metastore keeps track of the files)
Date Thu, 20 Oct 2016 21:34:58 GMT


Gopal V commented on HIVE-14535:

>  Do you think it would be reasonable to commit the changes to the FileSinkOperator without
the rest of the MM tables support?

No, a direct output committer approach without query isolation has lost data for production
customers before, by forcing multiple tasks to write to the same file-name by accident - due
to the way checksum-safety works, the first writer is not the winner in failure-tolerance

We want to prevent users from making such expensive mistakes again, so this patch isolates
different queries from each other - without which you will stomp over files.

>  I know there are some concerns that this "direct output committer" approach could cause
data corruption issues, is this something was considered explicitly in the design? If so,
could you expand on why those data corruption issues would occur?

Without the isolation fix, the other parts are dangerous to use. 

With the isolation in place, the system moves away from the move model to a cleanup model
(the cleanup code already exists, it is just applied to the scratch dir today).

> add micromanaged tables to Hive (metastore keeps track of the files)
> --------------------------------------------------------------------
>                 Key: HIVE-14535
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
> Design doc: 
> Feel free to comment.

This message was sent by Atlassian JIRA

View raw message