Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
Date: Sun, 8 Nov 2015 23:21:54 -0500
Message-ID: 
 <CABeTNM-4=e6AUWaPbijKTRcrkDu-VDZ4Ki3Fw8vhnce_SxOEAA@mail.gmail.com>
Subject: Pointing Hive external table partition to multiple locations?
From: TJ Tech <tjtechweb@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001a11353aaaea1b42052413ee89

--001a11353aaaea1b42052413ee89
Content-Type: text/plain; charset=UTF-8

Hi,


I need to process a few hundred thousands of files (1-2 GB each) scattered
in thousands of different directories.

I'd like to partition/group them based on my custom logic so I can benefit
from partition pruning. Each partition will contain a few hundreds files
from hundreds of different directories.

Is this supported? From Hive Language manual DDL, a partition can be
pointed to only one location. If I add one partition for each file I plan
to process, I'd end up have a few hundreds and even thousands of
partitions. I suspect this might result in hundreds to thousands of MR
tasks in Hadoop.

I noticed there is a feature added to support pointing an external table to
multiple locations listed in a symlink file:
https://issues.apache.org/jira/browse/HIVE-1272 (for TextInputFormat only)

Is there a similar feature in work for partition? If so, would it support
other formats (avro, parquet, etc)?


Thanks

Tao

--001a11353aaaea1b42052413ee89
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div><br></div><div>I need to process a =
few hundred thousands of files (1-2 GB each) scattered in thousands of diff=
erent directories.</div><div><br></div><div>I&#39;d like to partition/group=
 them based on my custom logic so I can benefit from partition pruning. Eac=
h partition will contain a few hundreds files from hundreds of different di=
rectories.</div><div><br></div><div>Is this supported? From Hive Language m=
anual DDL, a partition can be pointed to only one location. If I add one pa=
rtition for each file I plan to process, I&#39;d end up have a few hundreds=
 and even thousands of partitions. I suspect this might result in hundreds =
to thousands of MR tasks in Hadoop.</div><div><br></div><div>I noticed ther=
e is a feature added to support pointing an external table to multiple loca=
tions listed in a symlink file:=C2=A0<a href=3D"https://issues.apache.org/j=
ira/browse/HIVE-1272">https://issues.apache.org/jira/browse/HIVE-1272</a>=
=C2=A0(for TextInputFormat only)</div><div><br></div><div>Is there a simila=
r feature in work for partition? If so, would it support other formats (avr=
o, parquet, etc)?</div><div><br></div><div><br></div><div>Thanks</div><div>=
<br></div><div>Tao</div></div>

--001a11353aaaea1b42052413ee89--