hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Abdeen Hamed <ahmed.elma...@gmail.com>
Subject distributing a time consuming single reduce task
Date Mon, 23 Jan 2012 20:29:51 GMT
Hello friends,

I wrote a reduce() that receives a large dataset as a text values from the
map(). The purpose of the reduce() is to compute the distance between each
item in the values text. When I do, I run out of memory. I tried to
increase the heap size but that didn't scale either. I am wondering if
there is a way that I can distribute the reduce() to get it to scale. If
this is possible, can you kindly share your idea?
Please note, it is crucial for the values to be passed together in the
fashion that I am doing, so they can be clustered into groups.

Here is what the reduce() looks like:

public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
 Text key = new Text("1");

        Set<String> inputSet = new HashSet<String>();
        StringBuilder clusterBuilder = new StringBuilder();
        Set<Set<String>> clClustering = null;
        Text group = new Text();

        // Complete-Link Clusterer
        HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
        String[] brandsList = null;
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
 for(Text brand: brandNames){
        // perform clustering on the inputSet
        clClustering = clClusterer.cluster(inputSet);

        Iterator<Set<String>> itr = clClustering.iterator();

         Set<String> brandsSet = itr.next();
         for(String aBrand: brandsSet){
         clusterBuilder.append(aBrand + ",");
        clusterBuilder = new StringBuilder();
        context.write(key, group);



View raw message