hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Maryanne.DellaSa...@gdc4s.com>
Subject Simple change to WordCount either times out or runs 18+ hrs with little progress
Date Tue, 24 May 2011 14:29:18 GMT
I am attempting to familiarize myself with hadoop and utilizing
MapReduce in order to process system log files.  I had tried to start
small with a simple map reduce program similar to the word count example
provided.  I wanted for each line that I had read in, to grab the 5th
word as my output key, and the constant 1 as my output value.  This
seemed simple enough, but would consistently time out on mapping.  I
then attempted to run the WordCount example on my data to see if that
was the problem.  It was not, as the WordCount example quickly finished
with accurate results.  I then took the WordCount example, and added a
counter to the map so that it would only output the 5th word in the
line.  When I ran this, it ran for 18+ hrs with little to no progress.
I tried a programmatically identical way of getting the 5th word, and it
once again timed out.  Any help would be appreciated.

I am running in the Pseudo-Distributed layout described by the
Quickstart on a Windows XP machine running Cygwin.  I am working on
hadoop-0.21.0.  I have verified that I can run the examples provided and
that my nodes and trackers are running properly.

I took the WordCount example code described here: 
	
http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache
/hadoop/examples/WordCount.java?r=1027 	

and changed the Map function to:
  public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {
   
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
   
    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> output,
                    Reporter reporter) throws IOException {
      int count = 0;
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
    	if(count == 5)
    	{
    		word.set(itr.nextToken());
    		output.collect(word, one);
    	}
    	count++;
      }
    }
  }

Which after 18 hrs 35 min had map 0.55% complete.  There were no issues
in the logs or the command line.  Running this program without the count
variable maps in less than a minute on the same data.  When I changed it
to call itr.nextToken() 4 times before calling it a 5th to set the word,
it timed out.  I previously verified that the data always had more than
5 tokens per line.  My similar program which timed out regularly used
the split function on my delimiter to pull out the 5th word.  

Thank you for your help!
-	Maryanne DellaSalla

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message