hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <...@yahoo-inc.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Fri, 19 Oct 2007 04:44:05 GMT

On Oct 18, 2007, at 5:04 PM, Lance Amundsen wrote:

> You said arbitrary.. maybe I missed something.  Can I construct a
> getSplits() method that chunks up the file however I want?

Yes. The application specifies an InputFormat class, which has a  
getSplits method that returns a list of InputSplits. The "standard"  
input formats extends FileInputFormat, which has the behavior we have  
been describing. However, your InputFormat can generate InputSplits  
however it wants. For an example of an unusual variation, look at the  
RandomWriter example. It creates inputs splits that aren't based on  
any files at all. It just creates a split for each map that it wants.

>   I assumed I
> needed to return a split map that corresponded to key, value  
> boundaries,

SequenceFileInputFormat and TextInputFormat don't need the splits to  
match the record boundaries. They both start at the first record  
after the split's start offset and continue to the next record after  
the split's end. TextInputFormat always treats records as "/n" and  
SequenceFile uses constant blocks of bytes "sync markers" to find  
record boundaries.

> 1 file, 1000 records, 1000 maps requested yields 43 actual maps
> 1 file, 10,000 records,  10,000 maps requested yields 430 actual maps

I don't understand how this is happening. What is the data size,  
block size, and minimum split size in your job.

> In all of these cases I can only get 2 task/node running at the same
> time.... once in a while 3 run.... even though I have specified a  
> higher
> number to be allowed.

Are you maps finishing quickly (< 20 seconds)?

> I want 1 map per record, from one file, for any number of records,  
> and I
> want it guaranteed.  Later I may want 10 records, or a 100, but now  
> I right
> now I want to force a one record per mapper relationship, an I do  
> not want
> to pay the file creation overhead of, say 1000 files, just to get 1000
> maps.

That is completely doable. Although to make it perform well, you  
either need an index from row number to file offset or fixed width  
records... In any case, you'll need to write your own InputFormat.

-- Owen

View raw message