hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 周杰 <zhoujie...@126.com>
Subject A question about "input split"
Date Mon, 26 Sep 2011 09:02:06 GMT
hello,everyone!
when I see the source of the hadoop,I encounter a problem:
As we all know, when we set the mapred.max.split.size >= blocksize in the conf, the splitSize==blocksize
.
my question is when mapred.max.split.size < blocksize,the splitSize is smaller than blocksize,but
in the function  "getSplits()" of  Class FileInputFormat:
  public List<InputSplit> getSplits(JobContext job
                                    ) throws IOException {
 ......
    for (FileStatus file: listStatus(job)) {
      Path path = file.getPath();
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      long length = file.getLen();
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
      if ((length != 0) && isSplitable(job, path)) { 
        long blockSize = file.getBlockSize();
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);


        long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }
......
  }


notice the while Loop, if the splitSize is smaller than blocksize,there is a confused problem.
for example, bolcksize = 64,splitSize = 50,filelength = 200:
                bytesRemaining    splitSize       bytesRemaining/splitSize            length-bytesRemaining

first loop:  200                          50               4                             
                  0

2th  loop:  150                          50               3                              
                 50
that means when running the 2th loop ,

new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
start = length-bytesRemaining =50,  length = splitSize = 50,so the 2th loop cover two bulk
(bulks: 0---64,64---128,128---192........),and the start = 50,length = 50, that is to say,cover
50---64,64---100.
but in the contruction function  new FileSplit(),it just contain the only one bulk's info.(
blkLocations[blkIndex].getHosts()))
I could not understand this.




Mime
View raw message