hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varad Meru <varad_m...@persistent.co.in>
Subject RE: Very slow MapReduce Job
Date Tue, 30 Aug 2011 07:18:18 GMT
Hi,

The awk script we wrote for our function took only 40 seconds to process 174 MB of data where
as the MR code is taking more than 1 hr when run on eclipse. 
We ran a simple text processing MR to check the answer and it ran in 4 min in eclipse.
I am running on eclipse to verify the functionality of our program. 
My split-size is default split size
( I did a split.getLength() and it showed 332K bytes ~ 32 MB. Don't know why it wasn't 64
MB )

I have written a basic parser using if-else statements inside the EMailClass which implements
Writable.


-----Original Message-----
Hi Varad,

What is your splitsize? and how many nodes cluster are you running?

I think the issue is with generating the right number of splits which
decides the number of maps you will run. Processing all the data or more
data on few mappers will not give you the Map Reduce advantage of
parallelism. 
        
Regards,
Ravi Teja


-----Original Message-----
From: Varad Meru [mailto:varad_m...@persistent.co.in] 
Sent: Monday, August 29, 2011 3:30 PM
To: mapreduce-user@hadoop.apache.org
Cc: varad.m...@gmail.com
Subject: Very slow MapReduce Job

Hi,

I wrote a custom InputFormat for parsing through the Enron Email corpus
which is attached in the file named EmailInputFormat

I have attached the code in a text file with the sample input mail also
attached as a text document

The EmailClass extends Writable and implements all the methods needed to be
implemented and also contains an initiate function to initialize the values
in that class.

This initiate method looks is written in the EmailClass.java

The above method is called by nextKeyValue method which is written in the
EmailRecordReader.txt

------------------------------------
Question:
1. Is it a feasible to build large custom objects within nextKeyValue() to
run in Hadoop?
2. MR program which does a simple task of emitting message-id and from field
email-id from enron corpus of 6 lakh emails merged into one file (174 MB)
takes around 50 minutes on a pseudo node cluster. This is very very slow.
Please help me in this aspect too.
3. Can static field of value in EMailRecordReader help in this situation? 


Thanks in advance,
Varad.


------------------------------------
Varad Meru| Software Engineer
varad_m...@persistent.co.in
Persistent Systems and Solution Ltd. | Partners in Innovation |
www.persistentsys.com
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized to read, retain, copy, print, distribute
or use this message. If you have received this communication in error,
please notify the sender and delete all copies of this message. Persistent
Systems Ltd. does not accept any liability for virus infected mails.

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.


Mime
View raw message