hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From reveen joe <impdocs2...@gmail.com>
Subject Small files under SequenceFile table partition directories
Date Wed, 11 Nov 2015 05:26:29 GMT
Hi,

Most of our Hive tables are SequenceFile tables and there are currently
many small file ranging from *1-4 MB* under the Partition directories
(created by insert-overwrite). I am assuming this is due to 2 reasons

1. Some of our tables are Bucketed and so individual files are created for
each bucket of data for a given partition.

2. The places where we have set number of Reducers, produce 1 file per
Reducer.

So, the dir structure looks like below.

/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000000_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000001_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000002_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000003_0
............................................................................................
............................................................................................
............................................................................................
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000379_0

The Block size of the cluster is 128 MB. I know that sequence file can
store FileName as Key and FileContent as Value in sequence Files but in
this case they are independent files.

Am I right that - this would add overhead to the further processing of this
data as each file would need to spin up a JVM to start a Map Task against
that file and also because of the Disk IO overhead?

If so, what could be the best remedy to combine the small files under a
partition directory? Thank you.

Mime
View raw message