hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Roblee <chr...@unity3d.com>
Subject Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions
Date Fri, 17 Apr 2015 20:34:57 GMT
Hi Slava,

We would be interested in reviewing your patch.  Can you please provide more details?

Is there any other way to disable the partition creation step?

Thanks,
Chris

On 4/13/15 10:59 PM, Slava Markeyev wrote:
> This is something I've encountered when doing ETL with hive and having it create 10's
of thousands partitions. The issue
> is each partition needs to be added to the metastore and this is an expensive operation
to perform. My work around was
> adding a flag to hive that optionally disables the metastore partition creation step.
This may not be a solution for
> everyone as that table then has no partitions and you would have to run msck repair but
depending on your use case, you
> may just want the data in hdfs.
>
> If there is interest in having this be an option I'll make a ticket and submit the patch.
>
> -Slava
>
> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A <cheng.a.xu@intel.com <mailto:cheng.a.xu@intel.com>>
wrote:
>
>     Hi Tianqi,____
>
>     Can you attach hive.log as more detailed information?____
>
>     +Sergio____
>
>     __ __
>
>     Yours,____
>
>     Ferdinand Xu____
>
>     __ __
>
>     *From:*Tianqi Tong [mailto:ttong@brightedge.com <mailto:ttong@brightedge.com>]
>     *Sent:* Friday, April 10, 2015 1:34 AM
>     *To:* user@hive.apache.org <mailto:user@hive.apache.org>
>     *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k Partitions____
>
>     __ __
>
>     Hello Hive,____
>
>     I'm a developer using Hive to process TB level data, and I'm having some difficulty
loading the data to table.____
>
>     I have 2 tables now:____
>
>     __ __
>
>     -- table_1:____
>
>     CREATE EXTERNAL TABLE `table_1`(____
>
>        `keyword` string,____
>
>        `domain` string,____
>
>        `url` string____
>
>        )____
>
>     PARTITIONED BY (yearmonth INT, partition1 STRING)____
>
>     STORED AS RCfile____
>
>     __ __
>
>     -- table_2:____
>
>     CREATE EXTERNAL TABLE `table_2`(____
>
>        `keyword` string,____
>
>        `domain` string,____
>
>        `url` string____
>
>        )____
>
>     PARTITIONED BY (yearmonth INT, partition2 STRING)____
>
>     STORED AS Parquet____
>
>     __ __
>
>     I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with dynamic partitioning,
and the number of
>     partitions grows dramatically from 1500 to 40k (because I want to use something else
as partitioning).____
>
>     The mapreduce job was fine.____
>
>     Somehow the process stucked at " Loading data to table default.table_2 (yearmonth=null,
domain_prefix=null) ", and
>     I've been waiting for hours.____
>
>     __ __
>
>     Is this expected when we have 40k partitions?____
>
>     __ __
>
>     --------------------------------------------------------------____
>
>     Refs - Here are the parameters that I used:____
>
>     export HADOOP_HEAPSIZE=16384____
>
>     set PARQUET_FILE_SIZE=268435456;____
>
>     set parquet.block.size=268435456;____
>
>     set dfs.blocksize=268435456;____
>
>     set parquet.compression=SNAPPY;____
>
>     SET hive.exec.dynamic.partition.mode=nonstrict;____
>
>     SET hive.exec.max.dynamic.partitions=500000;____
>
>     SET hive.exec.max.dynamic.partitions.pernode=50000;____
>
>     SET hive.exec.max.created.files=1000000;____
>
>     __ __
>
>     __ __
>
>     Thank you very much!____
>
>     Tianqi Tong____
>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
>


Mime
View raw message