hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <Sanjay.Subraman...@wizecommerce.com>
Subject Re: Splitting input file - increasing number of mappers
Date Sat, 06 Jul 2013 15:18:02 GMT
More mappers will make it faster
     U can try this parameter
      mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
     This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based
on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar <parnab.2007@gmail.com<mailto:parnab.2007@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My task is to group
together all similar urls. For this i wrote a custom writable where i implemented the threshold
check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This
seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i decreasing
the efficiency in any way  or using Hadoops sort feature which hadoop does best  i am actually
doing the right thing.Now if this is the right thing too , then it seems my job  mostly relies
on the map task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which means not more
than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency
by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s)
and may contain confidential and privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient, please contact the sender
by reply email and destroy all copies of the original message along with any attachments,
from your computer system. If you are the intended recipient, please be advised that the content
of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Mime
View raw message