hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patterson, Josh" <jpatters...@tva.gov>
Subject Small Test Data Sets
Date Tue, 24 Mar 2009 18:04:04 GMT
I want to confirm something with the list that I'm seeing;
 
I needed to confirm that my Reader was reading our file format
correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
 
While running some basic tests on a small (1meg) single file I noticed
something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
 
I then went on to reason that since I had 2 mappers by default on my
job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block->mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
 
Josh Patterson
TVA
 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message