Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6DF7E7068 for ; Fri, 19 Aug 2011 23:54:35 +0000 (UTC) Received: (qmail 3073 invoked by uid 500); 19 Aug 2011 23:54:34 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 3014 invoked by uid 500); 19 Aug 2011 23:54:34 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 3006 invoked by uid 99); 19 Aug 2011 23:54:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 23:54:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sampd@stumbleupon.com designates 209.85.218.48 as permitted sender) Received: from [209.85.218.48] (HELO mail-yi0-f48.google.com) (209.85.218.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 23:54:25 +0000 Received: by yib17 with SMTP id 17so3548450yib.35 for ; Fri, 19 Aug 2011 16:54:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=stumbleupon.com; s=google; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer; bh=+nz4mVw5xdCWvXmePcU0igTdH/F5G+1oehoaCMRq7nI=; b=a0qg5FzhCEwinI/qs6A5UWSZIXYwMcqwx539tZlsggGAPSO0H8PPW067A4nPT756QM kQa/9uFaXC2K8XP5em8l/9QC7x8UwsV/5Ii8Rb1ROBrG/+yRGM3jkYD9PTqEsLKeHZ5j qXCbbkCFml2ZfH7caSOKBfOUMP9RgTB/Gnkrw= Received: by 10.42.130.68 with SMTP id u4mr336639ics.464.1313798044262; Fri, 19 Aug 2011 16:54:04 -0700 (PDT) Received: from ?IPv6:::1? (smtp.stumbleupon.com [74.201.117.226]) by mx.google.com with ESMTPS id v5sm1870040ibk.61.2011.08.19.16.54.03 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 19 Aug 2011 16:54:03 -0700 (PDT) From: Sam William Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_4F9B0FFF-45DB-44E9-A340-8989DDFEECC0" Subject: Re: Ignore subdirectories when querying external table Date: Fri, 19 Aug 2011 16:54:01 -0700 In-Reply-To: To: user@hive.apache.org References: Message-Id: <0FD83F2E-DF59-48AA-82B0-79ACACDF162A@stumbleupon.com> X-Mailer: Apple Mail (2.1244.3) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_4F9B0FFF-45DB-44E9-A340-8989DDFEECC0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 On similar lines, I want to have hive inlcude subdirs. That is.. I have an external table paritioned by month (data for each month under = a folder). Under the current month I want to keep adding folders = daily . Is this possible without having to subclass InputFormat ? On Aug 19, 2011, at 1:22 PM, Dave wrote: > I solved my own problem. For anyone who's curious: >=20 > It turns out that subclassing an InputFormat allows one to override = the listStatus method, which returns the list of files for Hive (or = mapreduce in general) to process. All I had to do was subclass = org.apache.hadoop.mapred.TextInputFormat and override the listStatus = method and voila; I was able to make it ignore directories. Here's the = java code that I used: >=20 > public class TextFileInputFormatIgnoreSubDir extends TextInputFormat { > @Override > protected FileStatus[] listStatus (JobConf job) throws IOException = { > FileStatus[] files =3D super.listStatus(job); > List newFiles =3D new ArrayList(); > int len =3D files.length; > for (int i =3D 0; i < len; ++i) { > FileStatus file =3D files[i]; > if (!file.isDir()) { > newFiles.add(file); > } > } >=20 > files =3D new FileStatus[newFiles.size()]; > for (int i =3D 0; i < newFiles.size(); ++i) { > files[i] =3D newFiles.get(i); > } >=20 > return files; > } > } >=20 > And the HiveQL code I used to define the table: >=20 > CREATE EXTERNAL TABLE users (id STRING, user_name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS INPUTFORMAT = 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir' > OUTPUTFORMAT = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION '/data/test/users'; >=20 > Hope this saves someone else the trouble of figuring it out... >=20 > -Dave >=20 > On Thu, Aug 18, 2011 at 3:53 PM, Dave wrote: > Hi, >=20 > I have a partitioned external table in Hive, and in the partition = directories there are other subdirectories that are not related to the = table itself. Hive seems to want to scan those directories, as I am = getting an error message when trying to do a SELECT on the table: >=20 > Failed with exception java.io.IOException:java.io.IOException: Not a = file: hdfs://path/to/partition/path/to/subdir >=20 > Also, it seems to ignore directories prefixed by an underscore = (_directory). >=20 > I am using hive 0.7.1 on Hadoop 0.20.2. >=20 > Is there a way to force Hive to ignore all subdirectories in external = tables and only look at files? >=20 > Thanks in advance, > -Dave >=20 Sam William sampd@stumbleupon.com --Apple-Mail=_4F9B0FFF-45DB-44E9-A340-8989DDFEECC0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 On = similar lines,  I want to  have hive inlcude   subdirs. =   That is..

I have an external  table = paritioned by month (data for each month under a folder).  Under =  the current month I want to  keep adding  folders daily = . Is this possible without having to subclass InputFormat = ?




On Aug 19, 2011, at 1:22 PM, Dave wrote:

I solved = my own problem. For anyone who's curious:

It turns = out that subclassing an InputFormat allows one to override the = listStatus method, which returns the list of files for Hive (or = mapreduce in general) to process. All I had to do was subclass = org.apache.hadoop.mapred.TextInputFormat and override the listStatus = method and voila; I was able to make it ignore directories. Here's the = java code that I used:

public class TextFileInputFormatIgnoreSubDir extends = TextInputFormat {
    = @Override
    protected FileStatus[] listStatus (JobConf job) throws = IOException {
        FileStatus[] files =3D = super.listStatus(job);
    =     List<FileStatus> newFiles =3D new = ArrayList<FileStatus>();
        int len =3D files.length;
    =     for (int i =3D 0; i < len; ++i) {
            FileStatus file =3D = files[i];
                = newFiles.add(file);
            = }
        }

        files =3D new = FileStatus[newFiles.size()];
        for (int i =3D 0; i = < newFiles.size(); ++i) {
    =         files[i] =3D newFiles.get(i);
        }

        return files;
    }
}
And the HiveQL code I used to define the table:

CREATE EXTERNAL = TABLE users (id STRING, user_name STRING)
ROW FORMAT DELIMITED = FIELDS TERMINATED BY '\t'
STORED AS = INPUTFORMAT = 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
OUTPUTFORMAT = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION = '/data/test/users';


On Thu, Aug 18, 2011 = at 3:53 PM, Dave <driver13@gmail.com> = wrote:
Hi,

I have a partitioned = external table in Hive, and in the partition directories there are other = subdirectories that are not related to the table itself. Hive seems to = want to scan those directories, as I am getting an error message when = trying to do a SELECT on the table:

Failed with exception = java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/= path/to/subdir

Also, it seems to ignore = directories prefixed by an underscore (_directory).

I am using hive 0.7.1 on Hadoop = 0.20.2.

Is there a way to force Hive to ignore = all subdirectories in external tables and only look at = files?

Thanks in advance,
-Dave


Sam = William
=


= --Apple-Mail=_4F9B0FFF-45DB-44E9-A340-8989DDFEECC0--