hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: Improve IdentityMapper code for wordcount
Date Sun, 21 Aug 2016 13:49:17 GMT
Hello,

One quick win could be to change this line:

context.write(word, new IntWritable(val));

Instead of instantiating an IntWritable on each iteration, instantiate it once as a member
variable (like what you’ve already done for word) and call IntWritable#set on each iteration.
 This might save some object allocation and garbage collection churn.

Beyond that, I’d recommend either profiling or looking at the overall workflow to see if
something can be changed at a higher level.  You mentioned that this is consuming the output
of the wordcount example job.  Perhaps you can change the wordcount job’s code to write
a more efficient representation of the data for your use case.  For example, if the counts
were stored as binary integers directly, then the second job wouldn’t have to pay the cost
of re-parsing them from strings by calling Integer#valueOf.

I hope this helps.

--Chris Nauroth

From: xeon Mailinglist <xeonmailinglist@gmail.com>
Date: Sunday, August 21, 2016 at 3:56 AM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Improve IdentityMapper code for wordcount

Hi,
I have created a map method that reads the map output of the wordcount example [1]. This example
is away from using the IdentityMapper.class that MapReduce offers, but this is the only way
that I have found to make a working IdentityMapper for the Wordcount. The only problem is
that this Mapper is taking much more time than I wanted. I am starting to think that maybe
I am doing some redundant stuff. Any help to improve my IdentityMapper code?

[1] Identity mapper

public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable>
{
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        word.set(itr.nextToken());
        Integer val = Integer.valueOf(itr.nextToken());
        context.write(word, new IntWritable(val));
    }

    public void run(Context context) throws IOException, InterruptedException {
        while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
    }
}


[2] Map class that generated the mapoutput

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());

        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }

    public void run(Context context) throws IOException, InterruptedException {
        try {
            while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } finally {
            cleanup(context);
        }
    }
}

Thanks,
Mime
View raw message