hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Blanca Hernandez <Blanca.Hernan...@willhaben.at>
Subject AW: AW: Extremely amount of memory and DB connections by MR Job
Date Tue, 30 Sep 2014 09:28:12 GMT
Hi,

with your answer and your questions I cannot answer, I realize that I miss a lot of Hadoop
understanding. I will proceed with analysis and deeper documentation readings. Do you know
some tutorial or similar where I can fully understand how Hadoop works and how it is performing
the MR job?

Or some company using Hadoop + mongoDB who could take a consultant role? (Preference in Austria).

It would be great to learn about this topic and not only by try, guess how other examples
works.

Many thanks for all your feedback. If I can get it understood, I will come back.

Best regards,
Blanca

Von: java8964 [mailto:java8964@hotmail.com]
Gesendet: Montag, 29. September 2014 17:01
An: user@hadoop.apache.org
Betreff: RE: AW: Extremely amount of memory and DB connections by MR Job

Here are my suggestions originally aims to improve the efficient:

1) In your case, you could use "StringBuilder", which has the append method, should be more
efficient to concatenate your string data in this case.
2) What I mean to reuse the Text object is as following:
     public class mapper extends Mapper<> () {
          private Text data = new Text();
          @Override
           public void map(final Object key, final BasicDBObject val, final Context context)
throws IOException, InterruptedException {
                     // instead of do "new Text(id)"
                     // you can always use the following way
                     data.set(id);
                     context.write(data, bsonWritable);
            }
As you can see, you avoid to create lots, lots of Text object in the map method. This method
could be invoked a lot of times. In this way, you avoid asking GC to clean a lot of Text object,
by reusing the same Text object per map. I believe you can do the same for BSONWritable. Check
the javadoc for that class.
3) 9G is a lot of heap for a map task. How many map tasks your job generates? Are your source
splitable? For one block data (I assume it is 128M or 256M), I cannot image you need 9G heap
for mapper. Your OOM maybe caused by that your job runs out of physical memory of all the
concurrent running mapper tasks.

1) How many total mapper tasks being generated in your job?
2) How many data/task nodes you have in your cluster? On the OOM node, how many mapper tasks
being kicked off? You can find all these information in the JobTracker in MR1, or AM in MR2.
3) If each mapper assigned 9G memory, and there are multi mappers running in the OOM node,
how much real physical memory you have?
4) You can see the input source for each mapper task in JobTracker or AM. If failed mapper
is always for the same block, then research that source data file. You need to have real good
reason to allocate 9G heap for a mapper task. Did you originally start from 1G?

Yong

________________________________
From: Blanca.Hernandez@willhaben.at<mailto:Blanca.Hernandez@willhaben.at>
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: AW: Extremely amount of memory and DB connections by MR Job
Date: Mon, 29 Sep 2014 14:16:24 +0000
Thanks for your answer.
To your questions:

1.       When you claim  96G ram, I am not sure what do you mean?
It is not 96 Gb RAM, it is 9 Gb that our test server has available (is it too small?).
2.       Your code is not efficient, as using the "+=" on String
I need (or at least I donĀ“t have a better idea) the concatenation of strings for the emited
ID, since I want to group my objects by, e.g. Audi_A3_2010, another group Audi_A3_2011....
And so on. These values are fields in the objects I get from the DB (BasicDBObject is a MongoDB
class).
3.       could have reused the Text object in your mapper
I am not sure if I understand your point. I create a new BSONWritable bsonWritable = new BSONWritable(val);
out of my data base object, since the one given by MongoDB is not mutable, hence not accepted
by haddop api as an outpu.

Now your other questions:
1) Are there any mappers successful?
Yes, but after a while, the job seems to need more memory, it runs very slow until it crashes.
2) The OOM mapper, is it always on the same block? If so, you need to dig into the source
data for that block, to think why it will cause OOM.
I am not sure about this. Is there a hint in the logs to figure it out?
3) Did you give reasonable heap size for the mapper? What it is?
9 Gb (too small??)

Best regards,
Blanca



Von: java8964 [mailto:java8964@hotmail.com]
Gesendet: Montag, 29. September 2014 15:43
An: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Betreff: RE: Extremely amount of memory and DB connections by MR Job

I don't have any experience with MongoDB, but just gave my 2 cents here.

Your code is not efficient, as using the "+=" on String, and you could have reused the Text
object in your mapper, as it is a mutable class, to be reused and avoid creating it again
and again like "new Text()" in the mapper. My guess that BSONWritable should be a similar
mutable class, if it aims to be used like the rest Writable Hadoop class.

But even like that, it should just make your mapper run slower, as a lot of objects need to
be GC, instead of OOM.

When you claim  96G ram, I am not sure what do you mean? From what you said, it failed in
mapper stage, so let's focus on mapper. What max heap size you gave to the mapper task? I
don't think 96G is the setting you mean to give to each mapper task. Otherwise, the only place
I can think is that there are millions of Strings to be appended in one record by "+=" and
cause the OOM.

You need to answer the following questions by yourself:

1) Are there any mappers successful?
2) The OOM mapper, is it always on the same block? If so, you need to dig into the source
data for that block, to think why it will cause OOM.
3) Did you give reasonable heap size for the mapper? What it is?

Yong

________________________________
From: Blanca.Hernandez@willhaben.at<mailto:Blanca.Hernandez@willhaben.at>
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Extremely amount of memory and DB connections by MR Job
Date: Mon, 29 Sep 2014 12:57:41 +0000
Hi,

I am using a hadoop map reduce job + mongoDb.
It goes against a data base 252Gb big. During the job the amount of conexions is over 8000
and we gave already 9Gb RAM. The job is still crashing because of a OutOfMemory with only
a 8% of the mapping done.
Are this numbers normal? Or did we miss something regarding configuration?
I attach my code, just in case the problem is with it.

Mapper:

public class AveragePriceMapper extends Mapper<Object, BasicDBObject, Text, BSONWritable>
{
    @Override
    public void map(final Object key, final BasicDBObject val, final Context context) throws
IOException, InterruptedException {
        String id = "";
        for(String propertyId : currentId.split(AveragePriceGlobal.SEPARATOR)){
            id += val.get(propertyId) + AveragePriceGlobal.SEPARATOR;
        }
        BSONWritable bsonWritable = new BSONWritable(val);
        context.write(new Text(id), bsonWritable);
    }
}


Reducer:
public class AveragePriceReducer extends Reducer<Text, BSONWritable, Text, Text>  {
    public void reduce(final Text pKey, final Iterable<BSONWritable> pValues, final
Context pContext) throws IOException, InterruptedException {
        while(pValues.iterator().hasNext() && continueLoop){
            BSONWritable next = pValues.iterator().next();
            //Make some calculations
        }        pContext.write(new Text(currentId), new Text(new MyClass(currentId, AveragePriceGlobal.COMMENT,
0, 0).toString()));

    }
}

The configuration includes a query which filters the number of objects to analyze (not the
252Gb will be analyzed).

Many thanks. Best regards,
Blanca

Mime
View raw message