hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avram Aelony (JIRA)" <>
Subject [jira] Commented: (HIVE-951) Selectively include EXTERNAL TABLE source files via REGEX
Date Wed, 25 Nov 2009 17:44:39 GMT


Avram Aelony commented on HIVE-951:

I think the filename can contain important information (e.g. datestamp, name of the type of
data it represents, etc...) that it is desirable to be able to parse out and then group by.

Imagine a few year's worth of data where there are 4 or more filetypes (each filetype having
a different set of columns) output to a bucket every day (e.g. 20091125_type_A.gz, 20091125_type_B.gz,
20091125_type_C.gz, 20091125_type_D.gz).  In fact, each day can contain 20 or more large files
per filetype (e.g. 20091125_type_A_01.gz, 20091125_type_A_02.gz, 20091125_type_A_03.gz, ...,
20091125_type_A_20.gz, repeat for B,C,D, etc... ). 

It would be nice to be able to parse out new variables for date, type, and type_number (e.g.
01, 02, ..., 20 ) and be able to compute various aggregated metrics via a group by of these
variables parsed from the filenames. Hopefully this parsing out would not be too much of a
performance bottleneck..(?)

So, I think there is a need both for a way to select certain files that match a regex from
an S3 bucket, and also a need for capturing filename information such that it can subsequently
be available for parsing and grouping.  It may be possible to achieve both needs in one use
case, but I don't know enough about Hive/Hadoop internals to judge myself.

> Selectively include EXTERNAL TABLE source files via REGEX
> ---------------------------------------------------------
>                 Key: HIVE-951
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Carl Steinbach
> CREATE EXTERNAL TABLE should allow users to cherry-pick files via regular expression.

> CREATE EXTERNAL TABLE was designed to allow users to access data that exists outside
of Hive, and
> currently makes the assumption that all of the files located under the supplied path
should be included
> in the new table. Users frequently encounter directories containing multiple
> datasets, or directories that contain data in heterogeneous schemas, and it's often
> impractical or impossible to adjust the layout of the directory to meet the requirements
> CREATE EXTERNAL TABLE. A good example of this problem is creating an external table based
> on the contents of an S3 bucket. 
> One way to solve this problem is to extend the syntax of CREATE EXTERNAL TABLE
> as follows:
> ...
> LOCATION path [file_regex]
> ...
> For example:
> {code:sql}
> CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
> LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';
> {code}
> Creates mytable1 which includes all files in s3:/my.bucket with a filename matching 'folder/2009*.bz2'
> {code:sql}
> CREATE EXTERNAL TABLE mytable2 ( d string, e int, f int, g int )
> LOCATION 'hdfs://data/' 'xyz.*2009????.bz2$';
> {code}
> Creates mytable2 including all files matching 'xyz*2009????.bz2' located under hdfs://data/

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message