hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject Will different files in HDFS trigger different mapper
Date Wed, 02 Oct 2013 20:22:01 GMT
Hi, I have a question related to how the mapper generated for the input files from HDFS. I
understand the split and blocks concept in the HDFS, but my originally understanding is that
one mapper will only process data from one file in HDFS, no matter how small this file it
is. Is that correct?
The reason for this is that in some ETL, I did see the logic to understand the data set based
on the file name convention. So in the mapper, before processing the first KV, we can build
some logic in the map() method to get the file name of the current input, and init some logic
here. After that, we don't need to worry data could be from another file later, as one mapper
task will only handle data from one file, even when the file is very small. So small files
not only cause trouble in NN memory, it also wastes the Map tasks, as map task could consume
too less data.
But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1), 
select partition_column, count(*) from test_table group by partition_column
It only generates 2 mappers in MR job. This is an external hive table, and the input bytes
for this MR job is only 338M, but the data files in the HDFS for this table is more than 100,
even though a lot of them is very small, as this is one node cluster, but it is configured
as one node full cluster mode, not local mode. Should the MR job generated here trigger at
least 100 mappers? Is this because in hive that my original assumption not work any more?
View raw message