hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Avram Aelony (JIRA)" <>
Subject [jira] Commented: (HIVE-951) Selectively include EXTERNAL TABLE source files via REGEX
Date Wed, 25 Nov 2009 19:04:39 GMT


Avram Aelony commented on HIVE-951:

Another consideration I'd like to mention that motivates the selection of files by regex,
is the unfortunate occurrence that different filetypes (within a bucket) may or may not share
the same record delimiter within a row line of a file.  

I have seen cases where some filetypes are tab delimited and other filetypes are comma delimited,
and even cases where files of a type are not even columns but key-value pairings, requiring
a map structure in Hive create-table time.

This feature allowing selection of files by regex will be quite powerful in that it will be
flexible enough to be able to read in each table once at a time per filetype. 

> Selectively include EXTERNAL TABLE source files via REGEX
> ---------------------------------------------------------
>                 Key: HIVE-951
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Carl Steinbach
> CREATE EXTERNAL TABLE should allow users to cherry-pick files via regular expression.

> CREATE EXTERNAL TABLE was designed to allow users to access data that exists outside
of Hive, and
> currently makes the assumption that all of the files located under the supplied path
should be included
> in the new table. Users frequently encounter directories containing multiple
> datasets, or directories that contain data in heterogeneous schemas, and it's often
> impractical or impossible to adjust the layout of the directory to meet the requirements
> CREATE EXTERNAL TABLE. A good example of this problem is creating an external table based
> on the contents of an S3 bucket. 
> One way to solve this problem is to extend the syntax of CREATE EXTERNAL TABLE
> as follows:
> ...
> LOCATION path [file_regex]
> ...
> For example:
> {code:sql}
> CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
> LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';
> {code}
> Creates mytable1 which includes all files in s3:/my.bucket with a filename matching 'folder/2009*.bz2'
> {code:sql}
> CREATE EXTERNAL TABLE mytable2 ( d string, e int, f int, g int )
> LOCATION 'hdfs://data/' 'xyz.*2009????.bz2$';
> {code}
> Creates mytable2 including all files matching 'xyz*2009????.bz2' located under hdfs://data/

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message