Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
Date: Tue, 10 Nov 2015 21:26:29 -0800
Message-ID: 
 <CAPukpS9aiLQpNy5qRd+tPD2-JF4zU=u7L5VAe0ocNqkqo6PUnA@mail.gmail.com>
Subject: Small files under SequenceFile table partition directories
From: reveen joe <impdocs2008@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001a114a6a7a8de38605243d11de

--001a114a6a7a8de38605243d11de
Content-Type: text/plain; charset=UTF-8

Hi,

Most of our Hive tables are SequenceFile tables and there are currently
many small file ranging from *1-4 MB* under the Partition directories
(created by insert-overwrite). I am assuming this is due to 2 reasons

1. Some of our tables are Bucketed and so individual files are created for
each bucket of data for a given partition.

2. The places where we have set number of Reducers, produce 1 file per
Reducer.

So, the dir structure looks like below.

/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000000_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000001_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000002_0
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000003_0
............................................................................................
............................................................................................
............................................................................................
/PATH/TO/TABLE/DIR/partition_column=2015-11-01//000379_0

The Block size of the cluster is 128 MB. I know that sequence file can
store FileName as Key and FileContent as Value in sequence Files but in
this case they are independent files.

Am I right that - this would add overhead to the further processing of this
data as each file would need to spin up a JVM to start a Map Task against
that file and also because of the Disk IO overhead?

If so, what could be the best remedy to combine the small files under a
partition directory? Thank you.

--001a114a6a7a8de38605243d11de
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>Most of our Hive tables are=C2=A0Se=
quenceFile tables and there are currently many small file ranging from <u>1=
-4 MB</u> under the Partition directories (created by insert-overwrite). I =
am assuming this is due to 2 reasons</div><div><br></div><div>1. Some of ou=
r tables are Bucketed and so individual files are created for each bucket o=
f data for a given partition.=C2=A0</div><div><br></div><div>2. The places =
where we have set number of Reducers, produce 1 file per Reducer.=C2=A0</di=
v><div><br></div><div>So, the dir structure looks like below.=C2=A0</div><d=
iv><br></div><div>/PATH/TO/TABLE/DIR/partition_column=3D2015-11-01//000000_=
0<br></div><div>/PATH/TO/TABLE/DIR/partition_column=3D2015-11-01//000001_0<=
br></div><div>/PATH/TO/TABLE/DIR/partition_column=3D2015-11-01//000002_0<br=
></div><div>/PATH/TO/TABLE/DIR/partition_column=3D2015-11-01//000003_0<br><=
/div><div>.................................................................=
...........................</div><div>.....................................=
.......................................................<br></div><div>.....=
...........................................................................=
............<br></div><div>/PATH/TO/TABLE/DIR/partition_column=3D2015-11-01=
//000379_0<br></div><div><br></div><div>The Block size of the cluster is 12=
8 MB. I know that sequence file can store FileName as Key and FileContent a=
s Value in sequence Files but in this case they are independent files.=C2=
=A0</div><div><br></div><div>Am I right that - this would add overhead to t=
he further processing of this data as each file would need to spin up a JVM=
 to start a Map Task against that file and also because of the Disk IO over=
head?=C2=A0</div><div><br></div><div>If so, what could be the best remedy t=
o combine the small files under a partition directory? Thank you.=C2=A0</di=
v><div><br></div></div>

--001a114a6a7a8de38605243d11de--