hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliot West (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-12860) Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
Date Wed, 13 Jan 2016 10:33:39 GMT
Elliot West created HIVE-12860:
----------------------------------

             Summary: Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
                 Key: HIVE-12860
                 URL: https://issues.apache.org/jira/browse/HIVE-12860
             Project: Hive
          Issue Type: New Feature
          Components: Hive
            Reporter: Elliot West
            Assignee: Elliot West


_As a Hive user_
_I'd like the option to seamlessly write out a header row to file system based result sets_
_So that I can generate reports whose specification mandates a header row._

h4. Motivations
There is a significant use-case where Hive is used to construct a scheduled data processing
pipeline that generates a report in HDFS for consumption by some third party (internal or
external). This report may then be transferred out of the system for consumption by other
tools or processes. It is not uncommon for the third party to specify that the report includes
a header row at the start of the file. The current options for adding headers are difficult
to use effectively and elegantly.

h4. Acceptance criteria
* {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to include a header
row at the start of the result set file.
* The header row will contain the column names derived from the accompanying {{SELECT}} query.
* It will likely be the case that multiple tasks will be writing the final file of the query
result set. In this event only the task writing the first chunk of the file should emit the
header row.

h4. Proposed HQL changes
{code}
1.  INSERT OVERWRITE [LOCAL] DIRECTORY directory1
2.    [ROW FORMAT row_format] [STORED AS file_format]
3.    [WITH HEADER]
4.    SELECT ... FROM ...
{code}
It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to enable this feature.
h4. Current workarounds
* It is usually suggested that users set the CLI option {{hive.cli.print.header=true}} and
capture the result set from standard out. However, this does not work well in scheduled, headless
environments such as the Oozie Hive action. This can also push the file handling into shell
scripts and complicate the process of getting the report into HDFS.
* The keep report processing entirely within the domain of Hive some users {{UNION}} the result
of their query with a tiny table of a single row containing the header names. A synthesised
rank column is used with an {{ORDER BY}} to ensure that the header is written to the very
start of the file. See [this example on Stack Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].

h4. References
* HIVE-138: Original request for header functionality.
* [Hive Wiki: writing data into the file system from queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message