hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Strohman <stroh...@cs.umass.edu>
Subject Combining MapReduce implementations
Date Wed, 11 Oct 2006 14:00:09 GMT
Hi all,

I just started using the Hadoop DFS last night, and it has already  
solved a big performance problem we were having with throughput from  
our shared NFS storage.  Thanks for everyone who has contributed to  
that project.

I wrote my own MapReduce implementation, because I needed two  
features that Hadoop didn't have: Grid Engine integration and easy  
record I/O (described below).  I'm writing this message to see if  
you're interested in these ideas for Hadoop, and to see what ideas I  
might learn from you.

Grid Engine: All the machines available to me run Sun's Grid Engine  
for job submission.  Grid Engine is important for us, because it  
makes sure that all of the users of a cluster get their fair share of  
resources--as far as I can tell, the JobTracker assumes that one user  
owns the machines.  Is this shared scenario you're interested in  
supporting?  Would you consider supporting job submission systems  
like Grid Engine or Condor?

Record I/O:  My implementation is something like  
org.apache.hadoop.record implementation, but with a couple of  
twists.  In my implementation, you give the system a simple Java  
class, like this:

public class WordCount {
	public String word;
	public long count;
}

and my TypeBuilder class generates code for all possible orderings of  
this class (order by word, order by count, order by word then count,  
order by count then word).  Each ordering has its own hash function  
and comparator.

In addition, each ordering has its own serialization/deserialization  
code.  For example, if we order by count, the serialization code  
stores only differences between adjacent counts to help with  
compression.

All this code is grouped into an Order object, which is accessed like  
this:
	String[] fields = { "word" };
	Order<WordCount> order = (new WordCountType()).getOrder( fields );
This order object contains a hash function, a comparator, and  
serialization logic for ordering WordCount objects by word.

Is this code you'd be interested in?

Thanks,

Trevor

(by the way, Doug, you may remember me from a panel at the OSIR  
workshop this year on open source search)


Mime
View raw message