hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Majid Azimi <majid.merk...@gmail.com>
Subject HDFS Block size vs Input Split Size
Date Sun, 18 Nov 2012 05:55:25 GMT
hi guys,

I want to get confirmation that I have understood this topic
correctly. HDFS block size is number of bytes that HDFS will split a large
files into small tokens. Input split size is number bytes each mapper will
actually process. It may be less or more than hdfs block size. Am* *I right?

suppose we want to load a 110MB text file to hdfs. hdfs block size and
Input split size is set to 64MB.

1. number of mappers is based on number of Input splits not number of hdfs
blocks? right?

2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
bytes? I mean it doesn't matter the file will be splitted from middle of
the line.

3. Now we have 2 input split (so two maps). Last line of first block and
first line of second block is not meaningful. TextInputFormat is
responsible for reading meaningful lines and giving them to map jobs. What
TextInputFormat does is:

   - In second block it will seek to second line which is a complete line
   and read from there and gives it to second mapper.
   - First mapper will read until the end of first block and also it will
   process the (last incomplete line of first block + first incomplete line of
   second block).

So the Input split size of first mapper is not exactly 64MB. it is a bit
more than that(first incomplete line of second block). Also Input split
size of second mapper is a bit less than 64 MB. Am I right?
So hdfs block size is an exact number but Input split size is based on our
data logic which may be a little different with the configured number?

View raw message