hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anh Nguyen <nguyenminh...@gmail.com>
Subject NLineInputFormat - Map always left out one split
Date Sun, 23 Aug 2009 20:39:41 GMT
I am using Hadoop for one of my research. I use NLineInputFormat for Map,
which take a few lines as one split. Each line specify a filename. So if I
have 10 input files 1..10 in my hdfs home, I would have an input file list
this:

*~/1*
*~/2*
*.*
*.*
*.*
*~/10*

It used to works fine but recently I ran into this problem: the Map phase
could not finish because it always left out 1 split. For example if I have 2
splits:

*09/08/23 15:32:02 INFO mapred.FileInputFormat: Total input paths to process
: 1
09/08/23 15:32:03 INFO mapred.JobClient: Running job: job_200908101504_0075
09/08/23 15:32:04 INFO mapred.JobClient:  map 0% reduce 0%
09/08/23 15:32:10 INFO mapred.JobClient:  map 50% reduce 0%
09/08/23 15:32:20 INFO mapred.JobClient:  map 50% reduce 8%*

And then everything is stuck there. I don't know why reduce get to 8% even
when Map is not finished. I am using Hadoop 0.19.1

I think this is Hadoop problem because at the very begining of each map task
I print out the input value, which is the name of the file that will get
processed. And when I look into the log of all mappers, many such output are
missing, meaning some files's location they are not sent to Mapper.

Any comment, suggestion on how to fix this is welcome.

Another related question: Is there a better way to split Map inputs so that
each raw binary file is one split, and the key = path of the file?
SequenceInputFile seems to require that both <K,V> is stored within the
file.

Thanks,

-- 
----------------------------
Anh Nguyen
http://www.im-nguyen.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message