hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam William <sa...@stumbleupon.com>
Subject Re: Ignore subdirectories when querying external table
Date Fri, 19 Aug 2011 23:54:01 GMT
On similar lines,  I want to  have hive inlcude   subdirs.   That is..

I have an external  table paritioned by month (data for each month under a folder).  Under
 the current month I want to  keep adding  folders daily . Is this possible without having
to subclass InputFormat ?




On Aug 19, 2011, at 1:22 PM, Dave wrote:

> I solved my own problem. For anyone who's curious:
> 
> It turns out that subclassing an InputFormat allows one to override the listStatus method,
which returns the list of files for Hive (or mapreduce in general) to process. All I had to
do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method
and voila; I was able to make it ignore directories. Here's the java code that I used:
> 
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
>     @Override
>     protected FileStatus[] listStatus (JobConf job) throws IOException {
>         FileStatus[] files = super.listStatus(job);
>         List<FileStatus> newFiles = new ArrayList<FileStatus>();
>         int len = files.length;
>         for (int i = 0; i < len; ++i) {
>             FileStatus file = files[i];
>             if (!file.isDir()) {
>                 newFiles.add(file);
>             }
>         }
> 
>         files = new FileStatus[newFiles.size()];
>         for (int i = 0; i < newFiles.size(); ++i) {
>             files[i] = newFiles.get(i);
>         }
> 
>         return files;
>     }
> }
> 
> And the HiveQL code I used to define the table:
> 
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
> 
> Hope this saves someone else the trouble of figuring it out...
> 
> -Dave
> 
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <driver13@gmail.com> wrote:
> Hi,
> 
> I have a partitioned external table in Hive, and in the partition directories there are
other subdirectories that are not related to the table itself. Hive seems to want to scan
those directories, as I am getting an error message when trying to do a SELECT on the table:
> 
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
> 
> Also, it seems to ignore directories prefixed by an underscore (_directory).
> 
> I am using hive 0.7.1 on Hadoop 0.20.2.
> 
> Is there a way to force Hive to ignore all subdirectories in external tables and only
look at files?
> 
> Thanks in advance,
> -Dave
> 

Sam William
sampd@stumbleupon.com




Mime
View raw message