hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roshan Naik (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-4196) Support for Streaming Partitions in Hive
Date Tue, 30 Apr 2013 02:36:16 GMT

     [ https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Roshan Naik updated HIVE-4196:
------------------------------

    Attachment: HIVE-4196.v1.patch

draft patch for review. based on phase mentioned in design doc.   Deviates slighlty... 
1) adds a couple of (temporary) rest calls to enable/disable streaming on a table. Later these
will be replaced with support in DDL. 

2) Also also HTTP methods are GET for easy testing with web browser

3) Authentication disabled on the new streaming HTTP methods


Usage Examples on db named 'sdb' & table named 'log' :

1) *Setup db & table with single partition column 'date':*
 hcat -e "create database sdb; use sdb; create table log(msg string, region string) partitioned
by (date string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED
AS TEXTFILE; "


2) *To check streaming status:*
 http://localhost:50111/templeton/v1/streaming/status?database=sdb&table=log

3) *Enable Streaming:*
 http://localhost:50111/templeton/v1/streaming/enable?database=sdb&table=log&col=date&value=1000

4) *Get Chunk File to write to:*
http://localhost:50111/templeton/v1/streaming/chunkget?database=sdb&table=log&schema=blah&format=blah&record_separator=blah&field_separator=blah

5) *Commit Chunk File:*
http://localhost:50111/templeton/v1/streaming/chunkcommit?database=sdb&table=log&chunkfile=/user/hive/streaming/tmp/sdb/log/2

6) *Abort Chunk File:*
http://localhost:50111/templeton/v1/streaming/chunkabort?database=sdb&table=log&chunkfile=/user/hive/streaming/tmp/sdb/log/3


7) *Roll Partition:*
http://localhost:50111/templeton/v1/streaming/partitionroll?database=sdb&table=log&partition_column=date&partition_value=3000
                
> Support for Streaming Partitions in Hive
> ----------------------------------------
>
>                 Key: HIVE-4196
>                 URL: https://issues.apache.org/jira/browse/HIVE-4196
>             Project: Hive
>          Issue Type: New Feature
>          Components: Database/Schema, HCatalog
>    Affects Versions: 0.10.1
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>         Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign.docx, HIVE-4196.v1.patch
>
>
> Motivation: Allow Hive users to immediately query data streaming in through clients such
as Flume.
> Currently Hive partitions must be created after all the data for the partition is available.
Thereafter, data in the partitions is considered immutable. 
> This proposal introduces the notion of a streaming partition into which new files an
be committed periodically and made available for queries before the partition is closed and
converted into a standard partition.
> The admin enables streaming partition on a table using DDL. He provides the following
pieces of information:
> - Name of the partition in the table on which streaming is enabled
> - Frequency at which the streaming partition should be closed and converted into a standard
partition.
> Tables with streaming partition enabled will be partitioned by one and only one column. It
is assumed that this column will contain a timestamp.
> Closing the current streaming partition converts it into a standard partition. Based
on the specified frequency, the current streaming partition  is closed and a new one created
for future writes. This is referred to as 'rolling the partition'.
> A streaming partition's life cycle is as follows:
>  - A new streaming partition is instantiated for writes
>  - Streaming clients request (via webhcat) for a HDFS file name into which they can write
a chunk of records for a specific table.
>  - Streaming clients write a chunk (via webhdfs) to that file and commit it(via webhcat).
Committing merely indicates that the chunk has been written completely and ready for serving
queries.  
>  - When the partition is rolled, all committed chunks are swept into single directory
and a standard partition pointing to that directory is created. The streaming partition is
closed and new streaming partition is created. Rolling the partition is atomic. Streaming
clients are agnostic of partition rolling.  
>  - Hive queries will be able to query the partition that is currently open for streaming.
only committed chunks will be visible. read consistency will be ensured so that repeated reads
of the same partition will be idempotent for the lifespan of the query.
> Partition rolling requires an active agent/thread running to check when it is time to
roll and trigger the roll. This could be either be achieved by using an external agent such
as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message