Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8644A18FB6 for ; Mon, 9 Nov 2015 04:22:08 +0000 (UTC) Received: (qmail 65793 invoked by uid 500); 9 Nov 2015 04:22:07 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 65702 invoked by uid 500); 9 Nov 2015 04:22:07 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 65692 invoked by uid 99); 9 Nov 2015 04:22:06 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2015 04:22:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 549B0CC887 for ; Mon, 9 Nov 2015 04:22:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.899 X-Spam-Level: ** X-Spam-Status: No, score=2.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 1tNJDOJ8UCBc for ; Mon, 9 Nov 2015 04:22:02 +0000 (UTC) Received: from mail-qg0-f53.google.com (mail-qg0-f53.google.com [209.85.192.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 9B6F2232AD for ; Mon, 9 Nov 2015 04:22:01 +0000 (UTC) Received: by qgeb1 with SMTP id b1so81420939qge.1 for ; Sun, 08 Nov 2015 20:21:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=bWwu/t0xNMh/IpC1/Ur0jHl5ViMt+XQxEGKwpwBDr7Y=; b=jm+ViKlqhd08kzKJJVoFJkt+2m/No3xvoozP5h6vDc/amnRCpbKWY/D3R1BIpqHPOF gPtSf3cQO1yT76Vu8sPAbZJ1cBRfTgbkf1Y9ZUeUvQOB2QhpRoRXRDeKchcg7X31Rgcv De1cTXkzwrgJHNmjemLaQ0t/9VUOw4ok8J+M7dlXBEq2GSUSrXrhLZk4+XcTiRoTDFj1 CAP9bPfLg9QiwE5KS8A2TgoTi5r8nIML2HdYDA/Zha/aGN/5lR+akGPLEsPeRtCHputV lsD/RBgwTlzL1hgL8IiH+zzPsGmhIjVISj4geRcJW/6Q3cTO+emI2yCx2xCnIGeJzqLt t+Tg== MIME-Version: 1.0 X-Received: by 10.140.146.10 with SMTP id 10mr28869682qhs.76.1447042914850; Sun, 08 Nov 2015 20:21:54 -0800 (PST) Received: by 10.55.43.68 with HTTP; Sun, 8 Nov 2015 20:21:54 -0800 (PST) Date: Sun, 8 Nov 2015 23:21:54 -0500 Message-ID: Subject: Pointing Hive external table partition to multiple locations? From: TJ Tech To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11353aaaea1b42052413ee89 --001a11353aaaea1b42052413ee89 Content-Type: text/plain; charset=UTF-8 Hi, I need to process a few hundred thousands of files (1-2 GB each) scattered in thousands of different directories. I'd like to partition/group them based on my custom logic so I can benefit from partition pruning. Each partition will contain a few hundreds files from hundreds of different directories. Is this supported? From Hive Language manual DDL, a partition can be pointed to only one location. If I add one partition for each file I plan to process, I'd end up have a few hundreds and even thousands of partitions. I suspect this might result in hundreds to thousands of MR tasks in Hadoop. I noticed there is a feature added to support pointing an external table to multiple locations listed in a symlink file: https://issues.apache.org/jira/browse/HIVE-1272 (for TextInputFormat only) Is there a similar feature in work for partition? If so, would it support other formats (avro, parquet, etc)? Thanks Tao --001a11353aaaea1b42052413ee89 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,


I need to process a = few hundred thousands of files (1-2 GB each) scattered in thousands of diff= erent directories.

I'd like to partition/group= them based on my custom logic so I can benefit from partition pruning. Eac= h partition will contain a few hundreds files from hundreds of different di= rectories.

Is this supported? From Hive Language m= anual DDL, a partition can be pointed to only one location. If I add one pa= rtition for each file I plan to process, I'd end up have a few hundreds= and even thousands of partitions. I suspect this might result in hundreds = to thousands of MR tasks in Hadoop.

I noticed ther= e is a feature added to support pointing an external table to multiple loca= tions listed in a symlink file:=C2=A0https://issues.apache.org/jira/browse/HIVE-1272= =C2=A0(for TextInputFormat only)

Is there a simila= r feature in work for partition? If so, would it support other formats (avr= o, parquet, etc)?


Thanks
=
Tao
--001a11353aaaea1b42052413ee89--