hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parnab kumar <parnab.2...@gmail.com>
Subject Splitting input file - increasing number of mappers
Date Sat, 06 Jul 2013 07:50:57 GMT
Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My
task is to group together all similar urls. For this i wrote a *custom
writable* where i implemented the threshold check in the
*compareTo*method.Therefore when Hadoop sorts the similar urls are
together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am
i decreasing the efficiency in any way  or using Hadoops sort feature which
hadoop does best  i am actually doing the right thing.Now if this is the
right thing too , then it seems my job  mostly relies on the map
task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which
means not more than 1 mapper will be invoked.Will splitting the file into
smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
MS student, IIT kharagpur

View raw message