hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Hadoop <hadoop...@yahoo.com>
Subject part-m-00000 files and their size - Hive table
Date Wed, 26 Feb 2014 01:42:20 GMT
Hi,

I am loading data to HDFS files through sqoop and creating a Hive table to point to these
files.

The mapper files through sqoop example are generated like this below.

part-m-00000

 part-m-00001

part-m-00002

My question is -
1) For Hive query performance , how important or significant is the distribution of the file
sizes above.

part_m_0 say 1 GB
part_m_1 say 3 GB
part_m_1 say 0.25 GB

Vs

part_m_0 say 1.4 GB
part_m_1 say 1.4 GB
part_m_1 say  1.45 B


NOTE : The size and no of files is just for sample. The real numbers are far bigger.


I am assuming the uniform distribution has a performance benefit .

If so, what is the reason and can I know the technical details. 

Mime
View raw message