hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avram Aelony <AvramAel...@eharmony.com>
Subject RE: adding filenames as new columns via Hive
Date Wed, 16 Sep 2009 18:23:45 GMT
Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new) derived fields, I will
be happy... :)
For example, in a later query I'd like to be able to do something like:

select
substr(filename, 4, 7) as  class_A,
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename,  8, 10) ;


thanks,
-A


From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to be a partition column
(one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns
just like the filename column and do not exist along with data on the disk.

Prasad
________________________________
From: Avram Aelony <AvramAelony@eharmony.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hive-user@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that
reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames
that provide useful information that does not get captured and does not exist separately as
a column within each file's data.  As a general problem, I'd like to be able to add a new
column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking
that at some point Hadoop or Hive must have a file handle with the filenames that perhaps
could be of use.  My hope is that this information could be added in (upon request) via Hive.
  Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains
the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com


Mime
View raw message