hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furcy Pin <pin.fu...@gmail.com>
Subject Re: Optimal approach for changing file format of a partitioned table
Date Mon, 06 Aug 2018 07:43:19 GMT
Hi Elliot,

>From your description of the problem, I'm assuming that you are doing a
INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table

or something close, like a CREATE TABLE AS ... maybe.

If this is the case, I suspect that your shuffle phase comes from dynamic
partitioning, and in particular from this option (quote from the doc)

hive.optimize.sort.dynamic.partition
>
>    - Default Value: true in Hive 0.13.0 and 0.13.1; false in Hive 0.14.0
>    and later (HIVE-8151 <https://issues.apache.org/jira/browse/HIVE-8151>)
>
>
>    - Added In: Hive 0.13.0 with HIVE-6455
>    <https://issues.apache.org/jira/browse/HIVE-6455>
>
> When enabled, dynamic partitioning column will be globally sorted. This
> way we can keep only one record writer open for each partition value in the
> reducer thereby reducing the memory pressure on reducers.


This option has been added to avoid OOM exceptions when doing dynamic
partitioned insertions, however it has disastrous performances for table
copy operations,
where only a Map phase should suffice. Disabling this option before your
query should suffice.

Also, beware that reading from and inserting to the same partitioned table
may create deadlock issues: https://issues.apache.org/jira/browse/HIVE-12258

Regards,

Furcy


On Sat, 4 Aug 2018 at 13:28, Elliot West <teabot@gmail.com> wrote:

> Hi,
>
> I’m trying to simply change the format of a very large partitioned table
> from Json to ORC. I’m finding that it is unexpectedly resource intensive,
> primarily due to a shuffle phase with the partition key. I end up running
> out of disk space in what looks like a spill to disk in the reducers.
> However, the partitioning scheme is identical on both the source and the
> destination so my expectation is a map only job that simply rencodes each
> file.
>
> I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
> could resolve my issue by allocating more storage to the task nodes.
> However, can anyone advise a more resource and time efficient approach?
>
> Cheers,
>
> Elliot.
>

Mime
View raw message