hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <>
Subject [jira] Commented: (HIVE-837) virtual column support (filename) in hive
Date Thu, 17 Sep 2009 17:15:57 GMT


Namit Jain commented on HIVE-837:

yesterday, i was having a offline conversation with Raghu, and we were thinking that this
is similar to the concept of buckets that exists currently.
So, do we enhance the tablesample clause to include filenames also, and not expose it as a
virtual column at all ?

I think it is more intuitive to have filenames in the where clause - maybe, we should have
some virtual columns for buckets also and leave the current
syntax for buckets as is for backward compatibility.

File pruning is must - so, having the filename as udf might be more difficult. The udf filename()
will return the same value at compile time.
So, I would prefer virtual columns instead of filenames. 

SELECT * FROM weblogs DATAFILE ('log1.txt', 'log2.txt') WHERE col1='..' and col2= ...
would solve the pruning problem since the file names are part of the syntax, but how do you
propose to select the filename in that case ?

So, I think the original syntax:
select * FROM logs where FILENAME LIKE(WEB1*)
might be easier

> virtual column support (filename) in hive
> -----------------------------------------
>                 Key: HIVE-837
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA
into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message