hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Schneider <cschneiderpub...@gmail.com>
Subject OutOfMemory during Plain Java MapReduce
Date Thu, 07 Mar 2013 22:18:56 GMT
during the Reduce phase or afterwards (i don't really know how to debug it)
I get a heap out of Memory Exception.

I guess this is because the value of the reduce task (a Custom Writable)
holds a List with a lot of user ids.
The Setup is quite simple. This are the related classes I used:

// The Reducer
// It just add all userIds of the Iterable to the UserSetWriteAble
public class UserToAppReducer extends Reducer<Text, Text, Text,
UserSetWritable> {

protected void reduce(final Text appId, final Iterable<Text> userIds, final
Context context) throws IOException, InterruptedException  {
final UserSetWritable userSet = new UserSetWritable();

final Iterator<Text> iterator = userIds.iterator();
while (iterator.hasNext()) {

context.write(appId, userSet);

// The Custom Writable
// Needed to implement a own toString Method bring the output into the
right format. Maybe i can to this also with a own OutputFormat class.
public class UserSetWritable implements Writable {
private final Set<String> userIds = new HashSet<String>();

public void add(final String userId) {

public void write(final DataOutput out) throws IOException {
for (final String userId : this.userIds) {

public void readFields(final DataInput in) throws IOException {
final int size = in.readInt();
for (int i = 0; i < size; i++) {
final String readUTF = in.readUTF();

public String toString() {
String result = "";
for (final String userId : this.userIds) {
result += userId + "\t";

result += this.userIds.size();
return result;

As Outputformat I used the default TextOutputFormat.

A potential problem could be, that a reduce is going to write files >600MB
and our mapred.child.java.opts is set to ~380MB.
I digged deeper into the TextOutputFormat and saw, that
the HdfsDataOutputStream is not implementing .flush().
And .flush is also not used in TextOutputFormat. This means, that the whole
file is kept in RAM and then persisted at the end of processing, or?
And of course, this leads into the exception.

With PIG I am able to query the same Data. Even with one reducer only.
But I have a bet to make it faster with plain MapReduce :)

Could you help me how to debug this and maybe point me into the right

Best Regards,

View raw message