Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D2A3F9EB for ; Fri, 29 Mar 2013 03:47:45 +0000 (UTC) Received: (qmail 41664 invoked by uid 500); 29 Mar 2013 03:47:40 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 41552 invoked by uid 500); 29 Mar 2013 03:47:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 41536 invoked by uid 99); 29 Mar 2013 03:47:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 03:47:39 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hemanty@thoughtworks.com designates 64.18.0.141 as permitted sender) Received: from [64.18.0.141] (HELO exprod5og101.obsmtp.com) (64.18.0.141) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 03:47:34 +0000 Received: from mail-ob0-f200.google.com ([209.85.214.200]) (using TLSv1) by exprod5ob101.postini.com ([64.18.4.12]) with SMTP ID DSNKUVUOwljDrJNsZAv36Eon9bamB4ClRvsZ@postini.com; Thu, 28 Mar 2013 20:47:14 PDT Received: by mail-ob0-f200.google.com with SMTP id un3so1277795obb.7 for ; Thu, 28 Mar 2013 20:47:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:x-received:in-reply-to:references:date :message-id:subject:from:to:content-type:x-gm-message-state; bh=LKh3M+SoeWd4nTp0sC5cEruz2mLmSnzd/F3HOEB2C/4=; b=R394F6RXP8wDHD/IVesGORH6Blfb7F9iKL5iUFmZIDbceTYHSl2IzdV3hNFf38bPMC baVYxh2yBVr+NYYzv8idHEPXzQ9+absMlZGjiQhBi/nAKbWT68tVOAelYx2uKn6kqrcX HE7LpHZ9zwun29Umc9ZVW96lvpa+OuvC661r76FIXyDXd223kzvIBt2KHPVH6FsKcawL Y6UiK+jcYyP/xGKeLx44WPFRWC78/u4yQBUb4tu9ky/NPINFBM6quLGok7vujEjH4BwX nEFfuyAiFsZfbbkeArdB37iR6uzM5JmOy1za8YbO2GBsHF6//4+3P2tCi85oxFSh0O4D VqqQ== X-Received: by 10.60.60.71 with SMTP id f7mr351469oer.128.1364528833817; Thu, 28 Mar 2013 20:47:13 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.60.71 with SMTP id f7mr351462oer.128.1364528833393; Thu, 28 Mar 2013 20:47:13 -0700 (PDT) Received: by 10.76.154.136 with HTTP; Thu, 28 Mar 2013 20:47:13 -0700 (PDT) In-Reply-To: References: Date: Fri, 29 Mar 2013 09:17:13 +0530 Message-ID: Subject: Re: Find reducer for a key From: Hemanth Yamijala To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=089e0149ca10668bff04d9082095 X-Gm-Message-State: ALoCoQmHUI679TuzUsn5t33kjuBnXwCFATbytTxt9uUW+srIst00Jw93OktOsvmsKz4kY8O051IafUzsQWvWydKSWZ8mrztTpyvrqGBOp6jEpAq24MFDnB/md/iUKwYJvWMfNStLnfAYV6TQXSXOfFVs/7PJ6f+13g== X-Virus-Checked: Checked by ClamAV on apache.org --089e0149ca10668bff04d9082095 Content-Type: text/plain; charset=ISO-8859-1 Hi, The way I understand your requirement - you have a file that contains a set of keys. You want to read this file on every reducer and take only those entries of the set, whose keys correspond to the current reducer. If the above summary is correct, can I assume that you are potentially reading the entire intermediate output key space on every reducer. Would that even work (considering memory constraints, etc). It seemed to me that your solution is implementing what the framework can already do for you. That was the rationale behind my suggestion. Maybe you should try and implement both approaches to see which one works better for you. Thanks hemanth On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli < cordioli.alberto@gmail.com> wrote: > Yes, that is a possible solution. > But since the MR job has another scope, the mappers already read other > files (very large) and output tuples. > You cannot control the number of mappers and hence the risk is that a > lot of mappers will be created, and each of them read also the other > file instead of a small number of reducers. > > Do you think that the solution I proposed is not so elegant or efficient? > > Alberto > > On 28 March 2013 13:12, Hemanth Yamijala > wrote: > > Hmm. That feels like a join. Can't you read the input file on the map > side > > and output those keys along with the original map output keys.. That way > the > > reducer would automatically get both together ? > > > > > > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli > > wrote: > >> > >> Hi Hemanth, > >> > >> thanks for your reply. > >> Yes, this partially answered to my question. I know how hash > >> partitioner works and I guessed something similar. > >> The piece that I missed was that mapred.task.partition returns the > >> partition number of the reducer. > >> So, putting al the pieces together I undersand that: for each key in > >> the file I have to call the HashPartitioner. > >> Then I have to compare the returned index with the one retrieved by > >> Configuration.getInt("mapred.task.partition"). > >> If it is equal then such a key will be served by that reducer. Is this > >> correct? > >> > >> > >> To answer to your question: > >> In a reduce side of a MR job, I want to load from file some data in a > >> in-memory structure. Actually, I don't need to store the whole file > >> for each reducer, but only the lines that are related to such keys a > >> particular reducers will receive. > >> So, my intention is to know the keys in the setup method to store only > >> the needed lines. > >> > >> Thanks, > >> Alberto > >> > >> > >> On 28 March 2013 11:01, Hemanth Yamijala > >> wrote: > >> > Hi, > >> > > >> > Not sure if I am answering your question, but this is the background. > >> > Every > >> > MapReduce job has a partitioner associated to it. The default > >> > partitioner is > >> > a HashPartitioner. You can as a user write your own partitioner as > well > >> > and > >> > plug it into the job. The partitioner is responsible for splitting the > >> > map > >> > outputs key space among the reducers. > >> > > >> > So, to know which reducer a key will go to, it is basically the value > >> > returned by the partitioner's getPartition method. For e.g this is the > >> > code > >> > in the HashPartitioner: > >> > > >> > public int getPartition(K2 key, V2 value, > >> > int numReduceTasks) { > >> > return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; > >> > } > >> > > >> > mapred.task.partition is the key that defines the partition number of > >> > this > >> > reducer. > >> > > >> > I guess you can piece together these bits into what you'd want.. > >> > However, I > >> > am interested in understanding why you want to know this ? Can you > share > >> > some info ? > >> > > >> > Thanks > >> > Hemanth > >> > > >> > > >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli > >> > wrote: > >> >> > >> >> Hi everyone, > >> >> > >> >> how can i know the keys that are associated to a particular reducer > in > >> >> the setup method? > >> >> Let's assume in the setup method to read from a file where each line > >> >> is a string that will become a key emitted from mappers. > >> >> For each of these lines I would like to know if the string will be a > >> >> key associated with the current reducer or not. > >> >> > >> >> I read something about mapred.task.partition and mapred.task.id, > but I > >> >> didn't understand the usage. > >> >> > >> >> > >> >> Thanks, > >> >> Alberto > >> >> > >> >> > >> >> -- > >> >> Alberto Cordioli > >> > > >> > > >> > >> > >> > >> -- > >> Alberto Cordioli > > > > > > > > -- > Alberto Cordioli > --089e0149ca10668bff04d9082095 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,

The way I understand your req= uirement - you have a file that contains a set of keys. You want to read th= is file on every reducer and take only those entries of the set, whose keys= correspond to the current reducer.

If the above summary is correct, can I assu= me that you are potentially reading the entire intermediate output key spac= e on every reducer. Would that even work (considering memory constraints, e= tc).

It seemed to me that your solution is imple= menting what the framework can already do for you. That was the rationale b= ehind my suggestion. Maybe you should try and implement both approaches to = see which one works better for you.

Thanks
hemanth
<= div class=3D"gmail_extra">

On Thu, Mar 28= , 2013 at 6:37 PM, Alberto Cordioli <cordioli.alberto@gmail.com> wrote:
Yes, that is a possible solution.
But since the MR job has another scope, the mappers already read other
files (very large) and output tuples.
You cannot control the number of mappers and hence the risk is that a
lot of mappers will be created, and each of them read also the other
file instead of a small number of reducers.

Do you think that the solution I proposed is not so elegant or efficient?
Alberto

On 28 March 2013 13:12, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:
> Hmm. That feels like a join. Can't you read the input file on the = map side
> and output those keys along with the original map output keys.. That w= ay the
> reducer would automatically get both together ?
>
>
> On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> <cordioli.alberto@gma= il.com> wrote:
>>
>> Hi Hemanth,
>>
>> thanks for your reply.
>> Yes, this partially answered to my question. I know how hash
>> partitioner works and I guessed something similar.
>> The piece that I missed was that mapred.task.partition returns the=
>> partition number of the reducer.
>> So, putting al the pieces together I undersand that: for each key = in
>> the file I have to call the HashPartitioner.
>> Then I have to compare the returned index with the one retrieved b= y
>> Configuration.getInt("mapred.task.partition").
>> If it is equal then such a key will be served by that reducer. Is = this
>> correct?
>>
>>
>> To answer to your question:
>> In a reduce side of a MR job, I want to load from file some data i= n a
>> in-memory structure. Actually, I don't need to store the whole= file
>> for each reducer, but only the lines that are related to such keys= a
>> particular reducers will receive.
>> So, my intention is to know the keys in the setup method to store = only
>> the needed lines.
>>
>> Thanks,
>> Alberto
>>
>>
>> On 28 March 2013 11:01, Hemanth Yamijala <yhemanth@thoughtworks.com>
>> wrote:
>> > Hi,
>> >
>> > Not sure if I am answering your question, but this is the bac= kground.
>> > Every
>> > MapReduce job has a partitioner associated to it. The default=
>> > partitioner is
>> > a HashPartitioner. You can as a user write your own partition= er as well
>> > and
>> > plug it into the job. The partitioner is responsible for spli= tting the
>> > map
>> > outputs key space among the reducers.
>> >
>> > So, to know which reducer a key will go to, it is basically t= he value
>> > returned by the partitioner's getPartition method. For e.= g this is the
>> > code
>> > in the HashPartitioner:
>> >
>> > =A0 public int getPartition(K2 key, V2 value,
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int numRe= duceTasks) {
>> > =A0 =A0 return (key.hashCode() & Integer.MAX_VALUE) % num= ReduceTasks;
>> > =A0 }
>> >
>> > mapred.task.partition is the key that defines the partition n= umber of
>> > this
>> > reducer.
>> >
>> > I guess you can piece together these bits into what you'd= want..
>> > However, I
>> > am interested in understanding why you want to know this ? Ca= n you share
>> > some info ?
>> >
>> > Thanks
>> > Hemanth
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> > <cordioli.al= berto@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> how can i know the keys that are associated to a particul= ar reducer in
>> >> the setup method?
>> >> Let's assume in the setup method to read from a file = where each line
>> >> is a string that will become a key emitted from mappers.<= br> >> >> For each of these lines I would like to know if the strin= g will be a
>> >> key associated with the current reducer or not.
>> >>
>> >> I read something about mapred.task.partition and mapred.task.id, but I
>> >> didn't understand the usage.
>> >>
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



--
Alberto Cordioli

--089e0149ca10668bff04d9082095--