hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Amundsen <lc...@us.ibm.com>
Subject Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
Date Tue, 23 Oct 2007 00:53:59 GMT
OK, I spent forever playing with over-riding SequenceFileInputFormat
behavior, and attempting my own completely different input format
(extending SFIF)... but I finally just decided to download tha Hadoop
source and see exactly what the heck it is doing.  It turns out that there
is a constant value in SequnceFile of SYNC_INTERVAL and that the SFIF
constructor calls setMinSplitSize with this value (2000).  So getting a
split size less than 2000 was impossible... so I just hard coded a
splitsize equal to my record size in FileInputFormat and now I am getting
exactly what I want, 1 map invocation per record per "one" input file.

Next I want to increase the concurrent # of tasks being executed for each
node... currently it seems like 2 or 3 is the upper limit (at least on the
earlier binaries I was running).

Any comments appreciated.... searching the code now.


IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)

650-678-8425 cell

             "Owen O'Malley"                                               
             m>                                                         To 
             10/18/2007 09:44                                           cc 
                                       Re: InputFiles, Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base                  

On Oct 18, 2007, at 5:04 PM, Lance Amundsen wrote:

> You said arbitrary.. maybe I missed something.  Can I construct a
> getSplits() method that chunks up the file however I want?

Yes. The application specifies an InputFormat class, which has a
getSplits method that returns a list of InputSplits. The "standard"
input formats extends FileInputFormat, which has the behavior we have
been describing. However, your InputFormat can generate InputSplits
however it wants. For an example of an unusual variation, look at the
RandomWriter example. It creates inputs splits that aren't based on
any files at all. It just creates a split for each map that it wants.

>   I assumed I
> needed to return a split map that corresponded to key, value
> boundaries,

SequenceFileInputFormat and TextInputFormat don't need the splits to
match the record boundaries. They both start at the first record
after the split's start offset and continue to the next record after
the split's end. TextInputFormat always treats records as "/n" and
SequenceFile uses constant blocks of bytes "sync markers" to find
record boundaries.

> 1 file, 1000 records, 1000 maps requested yields 43 actual maps
> 1 file, 10,000 records,  10,000 maps requested yields 430 actual maps

I don't understand how this is happening. What is the data size,
block size, and minimum split size in your job.

> In all of these cases I can only get 2 task/node running at the same
> time.... once in a while 3 run.... even though I have specified a
> higher
> number to be allowed.

Are you maps finishing quickly (< 20 seconds)?

> I want 1 map per record, from one file, for any number of records,
> and I
> want it guaranteed.  Later I may want 10 records, or a 100, but now
> I right
> now I want to force a one record per mapper relationship, an I do
> not want
> to pay the file creation overhead of, say 1000 files, just to get 1000
> maps.

That is completely doable. Although to make it perform well, you
either need an index from row number to file offset or fixed width
records... In any case, you'll need to write your own InputFormat.

-- Owen

View raw message