hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Hadoop <hadoop...@yahoo.com>
Subject Re: part-m-00000 files and their size - Hive table
Date Wed, 26 Feb 2014 03:33:05 GMT
Thanks for the detailed explanation Yong. It helps.

Regards,
Raj





On Tuesday, February 25, 2014 9:18 PM, java8964 <java8964@hotmail.com> wrote:
 
Yes, it is good that the file sizes are evenly close, but not very important, unless there
are files very small (compared to the block size).

The reasons are:

Your files should be splitable to be used in Hadoop (Or in Hive, it is the same thing). If
they are splitable, then 1G file will use 10 blocks (assume the block size is 128M), and 256M
file will take 2 blocks. So these 2 files will generate 12 mapper tasks, and will be equally
run in your cluster. From performance point of view, you have 12 mapper tasks, and they are
equally processed in the cluster. So one 1G file plus one 256M file are not big deal. But
if you have one file are very small, like 10M, that one file will also consume one mapper
task, and that is kind of bad for performance, as hadoop starting one mapper task only consuming
10M data, which is bad, because starting/stop tasks is using quite some resource, but only
processing 10M data.

The reason you see unevenly file size of the output of sqoop is that it is hard for sqoop
to split your source data evenly. For example, if you dump table A from DB to hive, sqoop
will do the following:

1) Identify the primary/unique keys of the table.
2) Find out the min/max value of the keys, let say they are (1 to 1,000,000)
3) Based on # of your mapper task, split them. If you run sqoop with 4 mappers, then the data
will be split into 4 groups (1, 250,000) (250,001, 500,000) (500,001, 750,000) (750,001, 1,000,000).
As you can image, your data most likely are not even distributed by the primary keys in that
4 groups, then you will get unevenly output as part-m-xxx files.

Keep in mind that it is not required to use primary keys or unique keys as the split column.
So you can choose whateven column in your table make sense. Pick up whateven can make the
split more even.

Yong



________________________________
Date: Tue, 25 Feb 2014 17:42:20 -0800
From: hadoopraj@yahoo.com
Subject: part-m-00000 files and their size - Hive table
To: user@hive.apache.org


Hi,

I am loading data to HDFS files through sqoop and creating a Hive table to point to these
files.

The mapper files through sqoop example are generated like this below.

part-m-00000

 part-m-00001

part-m-00002

My question is -
1) For Hive query performance , how important or significant is the distribution of the file
sizes above.

part_m_0 say 1 GB
part_m_1 say 3 GB
part_m_1 say 0.25 GB

Vs

part_m_0 say 1.4 GB
part_m_1 say 1.4 GB
part_m_1 say  1.45 B


NOTE : The size and no of files is just for sample. The real numbers are far bigger.


I am assuming the uniform distribution has a performance benefit .

If so, what is the reason and can I know the technical details. 
Mime
View raw message