hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avram Aelony <>
Subject Filtering out files in a bucket (update on HIVE-951)
Date Mon, 24 Jan 2011 20:09:51 GMT


I really like the virtual column feature in 0.7 that allows me to request INPUT__FILE__NAME
and see the names of files that are being acted on.  

Because I can see the files that are being read, I see that I am spending time querying many,
many very large files, most of which I do not need to process because these extra files are
in the same s3 bucket location that contains the files I need.  

The files I do need to process only a represent a subset of all files in the bucket. Nevertheless,
the files I am interested in are quite large, and large enough to make copying to hdfs unwieldy.

Since I know the files I want to process by name before the scan of all files, can I be more
efficient and only process a selection of files from a bucket avoiding those I don't?

I guess I am still looking for something like

Any suggestions or update on HIVE-951 ?

View raw message