hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holger Stenzhorn <holger.stenzh...@deri.org>
Subject Improving performance for large values in reduce
Date Thu, 07 Feb 2008 18:35:25 GMT

I am creating a small MapReduce application that works on large RDF 
dataset files in triple format (i.e. one RDF triple per line, "<subject> 
<predicate> <object>.").

In the mapper class I split up the triples into subject and object and 
then collect each subject/object as key plus the related complete triple 
as value (see [1]).

In the reducer class I now collect for each key again all collected 
values for the given key (i.e. subject/object) (see [2]):
The problem here is that the "concatenateValues(values)" method 
concatenates all values into one single string which then is collected 
for the given key.
This works fine for smaller thousands of triples but "gets stuck" in the 
reduce phase if I have e.g. more than some 300.000 triples to concatenate.

Does anybody have any solution on how this could be worked around?
...or just tell me if the way I am doing things here is plainly stupid?! ;-)

Thank you all very  much in advance!


  private static Pattern PATTERN =
  private static class TriplesMapper extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable key, Text value,
      OutputCollector<Text, Text> output, Reporter reporter) throws 
IOException {
      String line = new String(value.toString());
      Matcher matcher = PATTERN.matcher(line);
      if (matcher.matches()) {
        String subject = matcher.group(1);
        String object = matcher.group(3);
        output.collect(new Text(subject), new Text(line));
        output.collect(new Text(object), new Text(line));

  private static class TriplesFileReducer extends MapReduceBase
    implements Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterator<Text> values,
      OutputCollector<Text, Text> output, Reporter reporter) throws 
IOException {
      output.collect(key, new Text(concatenateValues(values)));

View raw message