hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shumin Guo <gsmst...@gmail.com>
Subject Re: Splitting input file - increasing number of mappers
Date Sat, 06 Jul 2013 15:56:25 GMT
You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>  Give these shot
>  Good luck
>  sanjay
>   From: parnab kumar <parnab.2007@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Splitting input file - increasing number of mappers
>  Hi ,
>          I have an input file where each line is of the form :
>             <URL> <A NUMBER>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore
when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>  Can someone kindly provide some insight,advice regarding the above.
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.

View raw message