hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Teja <ravit...@huawei.com>
Subject RE: Very slow MapReduce Job
Date Tue, 30 Aug 2011 06:17:28 GMT
Hi Varad,

What is your splitsize? and how many nodes cluster are you running?

I think the issue is with generating the right number of splits which
decides the number of maps you will run. Processing all the data or more
data on few mappers will not give you the Map Reduce advantage of
parallelism. 
	
Regards,
Ravi Teja


-----Original Message-----
From: Varad Meru [mailto:varad_meru@persistent.co.in] 
Sent: Monday, August 29, 2011 3:30 PM
To: mapreduce-user@hadoop.apache.org
Cc: varad.meru@gmail.com
Subject: Very slow MapReduce Job

Hi,

I wrote a custom InputFormat for parsing through the Enron Email corpus
which is attached in the file named EmailInputFormat

I have attached the code in a text file with the sample input mail also
attached as a text document

The EmailClass extends Writable and implements all the methods needed to be
implemented and also contains an initiate function to initialize the values
in that class.

This initiate method looks is written in the EmailClass.java

The above method is called by nextKeyValue method which is written in the
EmailRecordReader.txt

------------------------------------
Question:
1. Is it a feasible to build large custom objects within nextKeyValue() to
run in Hadoop?
2. MR program which does a simple task of emitting message-id and from field
email-id from enron corpus of 6 lakh emails merged into one file (174 MB)
takes around 50 minutes on a pseudo node cluster. This is very very slow.
Please help me in this aspect too.
3. Can static field of value in EMailRecordReader help in this situation? 


Thanks in advance,
Varad.


------------------------------------
Varad Meru| Software Engineer
varad_meru@persistent.co.in
Persistent Systems and Solution Ltd. | Partners in Innovation |
www.persistentsys.com
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized to read, retain, copy, print, distribute
or use this message. If you have received this communication in error,
please notify the sender and delete all copies of this message. Persistent
Systems Ltd. does not accept any liability for virus infected mails.



Mime
View raw message