hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <>
Subject Re: Hive 12 - CDH 5.0.1 - many small files when using ORC table
Date Tue, 18 Aug 2015 18:34:58 GMT
Those metadata information are required to identify the file is a valid orc file. It will just
have some bytes for ORC header, postscript information (compression, version, buffer size
etc. that is specified via table properties). Its not completely safe to delete those empty
bucket files as there are some known issue related to joins.

On Aug 18, 2015, at 8:46 AM, Juraj jiv <<>>

Hi, yes i saw somewhere in sql scripts enabled bucketing adhoc via set command - "hive.enforce.bucketing"
+ "hive.optimize.bucketmapjoin" . So those metada information are required? I cant just delete
those 43b files?


On Tue, Aug 18, 2015 at 5:35 PM, Prasanth Jayachandran <<>>
Are you using bucketing? If so those are empty ORC files without any data containing only
metadata information.

From: Juraj jiv <<>>
Sent: Tuesday, August 18, 2015 8:28 AM
Subject: Hive 12 - CDH 5.0.1 - many small files when using ORC table
To: <<>>

Hello all,

i have question about ORC table format. We use it as for our datastore tables but during maintenance
i noticed there is many small files inside tables which I presume doesn't contains any data.
They are only 43bytes in size and they takes around 70% of all files inside table folder.

For example (grep 43 bytes is size and other):

hadoop@hadoopnn:~$ hdfs dfs -du -h /user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30
| grep "^43 " | wc -l
hadoop@hadoopnn:~$ hdfs dfs -du -h /user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30
| grep -v "^43 " | wc -l

Why is that? Why is there those many 43bytes files?

Ascii content of the files is, which i guess is just ORC header:

hive version:
0.12.0+cdh5.0.1+315     1.cdh5.0.1.p0.31     CDH 5


View raw message