hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Implementing own InputFormat and RecordReader
Date Mon, 15 Sep 2008 16:43:26 GMT
On Sep 15, 2008, at 6:13 AM, Juho Mäkinen wrote:

> 1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
> input file is 128MB and my HDFS block size is 64MB, will it return one
> InputSplit or two InputSplits?

Your InputFormat needs to define:

protected boolean isSplitable(FileSystem fs, Path filename) {
   return false;
}

which tells the FileInputFormat.getSplits to not split files. You will  
end up with a single split for each file.

> 2) If my file is splitted into two or more filesystem blocks, how will
> hadoop handle the reading of those blocks? As the file must be read in
> sequence, will hadoop first copy every block to a machine (if the
> blocks aren't already in there) and then start the mapper in this
> machine? Do I need to handle the reading and opening multiple blocks,
> or will hadoop provide me a simple stream interface which I can use to
> read the entire file without worrying if the file is larger than the
> HDFS block size?

HDFS transparently handles the data motion for you. You can just use  
FileSystem.open(path) and HDFS will pull the file from the closest  
location. It doesn't actually move the block to your local disk, just  
gives it to the application. Basically, you don't need to worry about  
it.

There are two downsides to unsplitable files. The first is that if  
they are large, the map times can be very long. The second is that the  
map/reduce scheduler tries to place the tasks close to the data, which  
it can't do very well if the data spans blocks. Of course if data  
isn't splitable, you don't have a choice.

-- Owen
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message