hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nandakumar Ambilikumar <nandaku...@saavn.com>
Subject Create single file inside partition while inserting
Date Wed, 08 Mar 2017 11:23:46 GMT
Hi,
We are facing some issues with hive jobs while inserting into tables. It creating a very large
file (single file of 100s of GB) per partition. We were able to create multiple files (size
~ 200-300 MB) on the older hive (V 0.11) on AWS emr version 2.4.2. But once we upgraded to
the new version (Hive 2.1.1, emr-5.3.1 ) it is creating a single file of 100s of GB inside
each partition. we had tried the following hive settings and still not able to figure it out
how to split the single file into multiple files of smaller size. PFB setting we have tried.

1. Mentioned number of reducers
set hive.exec.dynamic.partition=true;
set mapred.reduce.tasks = 1000;
set hive.exec.max.dynamic.partitions.pernode=20000;
set hive.exec.max.dynamic.partitions=20000;
set hive.merge.mapredfiles = False;
set mapred.max.split.size=1048576;
set mapred.min.split.size=1048576;

2. Compression using gzip
set hive.exec.orc.default.compress = gzip;
set mapreduce.input.fileinputformat.split.maxsize=1048576;
set mapreduce.input.fileinputformat.split.minsize=1048576;
set hive.exec.compress.output=true;
set hive.mapjoin.smalltable.filesize = 2500;
set mapred.output.compression.type=BLOCK;
set hive.merge.size.per.task=10;
set hive.merge.smallfiles.avgsize=10000;
set hive.merge.mapredfiles=false;
set hive.merge.mapfiles =true;

3. Compression using snappy and try to split to 100 MB files
set hive.exec.compress.output=true;
set mapred.max.split.size=100000000;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=100000000;
set hive.merge.smallfiles.avgsize=100000000;

We have tried all these options and nothing worked as desired. We have found one more setting
in the hive,  hive.hadoop.supports.splittable.combineinputformat, which have the default value
is false and was added in the hive0.6 version. But we got an error (Query returned non-zero
code: 1, cause: hive configuration hive.hadoop.supports.splittable.combineinputformat does
not exist.) while trying in the emr 5.3.1. Can you please let us know, is there any corresponding
setting in new Hadoop version since the above one shows depreciated in Hadoop 2.x version.

Apart from these settings we have the below global settings, which we set for all scripts.

set hive.execution.engine=tez;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.metastore.client.socket.timeout=2h;
set hive.support.sql11.reserved.keywords=false;
set hive.msck.path.validation=ignore;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=20000;
set hive.exec.max.dynamic.partitions=20000;

Please have a look at the above settings and let us know how can we resolve the issue or any
other workaround for the same. Any light to this issue will be grateful.
Thanks,
Nandan
 
Mime
View raw message