hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Wang <nathanw...@yahoo.com>
Subject Re: Improving performance for large values in reduce
Date Thu, 07 Feb 2008 23:16:55 GMT
It depends on the uniqueness of your input data and maybe on how you implemented concatenateValues.
Since you're collecting twice for each line, on both subject and object, then concatenating
the original line twice again.

If you have many rows with the same subjects and objects, you'll end up with a huge string
of duplicated substrings.
Don't know why you wanted to do this way.  Check your logic.

If you use + to concatenate strings in concatenateValues, it'll greatly slow down the operation.
 Use StringBuilder.append, instead.

----- Original Message ----
From: Holger Stenzhorn [holger.stenzhorn@deri.org]

I am creating a small MapReduce application that works on large RDF dataset files in triple
format (i.e. one RDF triple per line, "<subject> <predicate> <object>.").
In the mapper class I split up the triples into subject and object and then collect each subject/object
as key plus the related complete triple as value (see [1]).
In the reducer class I now collect for each key again all collected values for the given key
(i.e. subject/object) (see [2]):
The problem here is that the "concatenateValues(values)" method concatenates all values into
one single string which then is collected for the given key.
This works fine for smaller thousands of triples but "gets stuck" in the reduce phase if I
have e.g. more than some 300.000 triples to concatenate.
Does anybody have any solution on how this could be worked around?
...or just tell me if the way I am doing things here is plainly stupid?! ;-)
Thank you all very much in advance!
private static Pattern PATTERN =
private static class TriplesMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = new String(value.toString());
Matcher matcher = PATTERN.matcher(line);
if (matcher.matches()) {
String subject = matcher.group(1);
String object = matcher.group(3);
output.collect(new Text(subject), new Text(line));
output.collect(new Text(object), new Text(line));
private static class TriplesFileReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) throws 
IOException {
output.collect(key, new Text(concatenateValues(values)));
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message