Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of
 graeme.wallace@farecompare.com designates 74.125.149.244 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPQV63UF0Eb-QRwo3Wn_L-ZNbeqWRyFoUBDPReBpR4rjwp2-2A@mail.gmail.com>
References: 
 <CAPQV63X+Nb7g5nSr9ijmShW5GfP9mGgyn5rJ=JGK53UXiu1d6A@mail.gmail.com>
	<CAORpBsjNj18i1F2u9=3fk-n2f=3XGsDb1EpkgdbfBydWzG6e3w@mail.gmail.com>
	<CAPQV63XMkansMy3u8gwDtc8xN346n1onRv1FA5eS4pkJA=dYKw@mail.gmail.com>
	<CALte62xsVHhBU3w=m=HPFmA8xUYbGd5s5NbM9BcWSq8-6=z5fA@mail.gmail.com>
	<CAORpBsgkfjbyQR9cfJgPKW-bp6QcAr4QUt+BJhojKWkJ0sr_eQ@mail.gmail.com>
	<CAPQV63UkbqHwKFVVtw=v__gHjbPDw17rPOXgqTqMKZx=GXsxrg@mail.gmail.com>
	<CAPQV63UscgD4oosT6tDesryk9jbsB_5sYRM0MbUwt6wHPGk1CQ@mail.gmail.com>
	<CAP0_YE_gzzTUCiv-14EpB88=Vd0PVDt_vKPRz3evyDbzT_749Q@mail.gmail.com>
	<CAPQV63UF0Eb-QRwo3Wn_L-ZNbeqWRyFoUBDPReBpR4rjwp2-2A@mail.gmail.com>
Date: Wed, 10 Apr 2013 14:06:05 -0500
Message-ID: 
 <CAP0_YE9m07AxTTtBfuoVPx4C=9U7y9kj2hcRUd9B6c0nxaOPfA@mail.gmail.com>
Subject: Re: MapReduce: Reducers partitions.
From: Graeme Wallace <graeme.wallace@farecompare.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=20cf301cbed69ffa2004da065c8b

--20cf301cbed69ffa2004da065c8b
Content-Type: text/plain; charset=ISO-8859-1

Ok. Thanks.


On Wed, Apr 10, 2013 at 2:01 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Greame,
>
> No. The reducer will simply write on the table the same way you are doing a
> regular Put. If a split is required because of the size, then the region
> will be split, but at the end, there will not necessary be any region
> split.
>
> In the usecase described below, all the 600 lines will "simply" go into the
> only region in the table and no split will occur.
>
> The goal is to partition the data for the reducer only. Not in the table.
>
> JM
>
> 2013/4/10 Graeme Wallace <graeme.wallace@farecompare.com>
>
> > Whats the behavior then if you return hash % num_reducers and you have no
> > splits defined. When the reducer writes to the table does the region
> server
> > local to the reducer create a new region ?
> >
> > Graeme
> >
> >
> > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > So.
> > >
> > > I looked at the code, and I have one comment/suggestion here.
> > >
> > > If the table we are outputing to has regions, then partitions are build
> > > around that, and that's fine. But if the table is totally empty with a
> > > single region, even if we setNumReduceTasks to 2 or more, all the keys
> > will
> > > go on the same first reducer because of this:
> > >     if (this.startKeys.length == 1){
> > >       return 0;
> > >     }
> > > I think it will be better to return something like keycrc%numPartitions
> > > instead. That still allow the application to spread the reducing
> process
> > > over multinode(racks) even if there is only one region in the table.
> > >
> > > In my usecase, I have millions of lines producing some statistics. At
> the
> > > end, I will have only about 600 lines, but it will take a lot of map
> and
> > > reduce time to go from millions to 600, that's why I'm looking to have
> > more
> > > than one reducer. However, with only 600 lines, it's very difficult to
> > > pre-split the table. Keys are all very close.
> > >
> > > Does anyone see anything wrong with changing this default behaviour
> when
> > > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> > >
> > > JM
> > >
> > > 2013/4/10 Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> > >
> > > > Thanks Ted.
> > > >
> > > > It's exactly where I was looking at now. I was close. I will take a
> > > deeper
> > > > look.
> > > >
> > > > Thanks Nitin for the link. I will read that too.
> > > >
> > > > JM
> > > >
> > > > 2013/4/10 Nitin Pawar <nitinpawar432@gmail.com>
> > > >
> > > >> To add what Ted said,
> > > >>
> > > >> the same discussion happened on the question Jean asked
> > > >>
> > > >> https://issues.apache.org/jira/browse/HBASE-1287
> > > >>
> > > >>
> > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > >>
> > > >> > Jean-Marc:
> > > >> > Take a look at HRegionPartitioner which is in both mapred and
> > > mapreduce
> > > >> > packages:
> > > >> >
> > > >> >  * This is used to partition the output keys into groups of keys.
> > > >> >
> > > >> >  * Keys are grouped according to the regions that currently exist
> > > >> >
> > > >> >  * so that each reducer fills a single region so load is
> > distributed.
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > > >> > jean-marc@spaggiari.org> wrote:
> > > >> >
> > > >> > > Hi Nitin,
> > > >> > >
> > > >> > > You got my question correctly.
> > > >> > >
> > > >> > > However, I'm wondering how it's working when it's done into
> HBase.
> > > Do
> > > >> > > we have defaults partionners so we have the same garantee that
> > > records
> > > >> > > mapping to one key go to the same reducer. Or do we have to
> > > implement
> > > >> > > this one our own.
> > > >> > >
> > > >> > > JM
> > > >> > >
> > > >> > > 2013/4/10 Nitin Pawar <nitinpawar432@gmail.com>:
> > > >> > > > I hope i understood what you are asking is this . If not then
> > > >> pardon me
> > > >> > > :)
> > > >> > > > from the hadoop developer handbook few lines
> > > >> > > >
> > > >> > > > The*Partitioner* class determines which partition a given
> (key,
> > > >> value)
> > > >> > > pair
> > > >> > > > will go to. The default partitioner computes a hash value for
> > the
> > > >> key
> > > >> > and
> > > >> > > > assigns the partition based on this result. It garantees that
> > all
> > > >> the
> > > >> > > > records mapping to one key go to same reducer
> > > >> > > >
> > > >> > > > You can write your custom partitioner as well
> > > >> > > > here is the link :
> > > >> > > >
> > > >>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > > >> > > > jean-marc@spaggiari.org> wrote:
> > > >> > > >
> > > >> > > >> Hi,
> > > >> > > >>
> > > >> > > >> quick question. How are the data from the map tasks
> > partitionned
> > > >> for
> > > >> > > >> the reducers?
> > > >> > > >>
> > > >> > > >> If there is 1 reducer, it's easy, but if there is more, are
> all
> > > >> they
> > > >> > > >> same keys garanteed to end on the same reducer? Or not
> > necessary?
> > > >>  If
> > > >> > > >> they are not, how can we provide a partionning function?
> > > >> > > >>
> > > >> > > >> Thanks,
> > > >> > > >>
> > > >> > > >> JM
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Nitin Pawar
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nitin Pawar
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Graeme Wallace
> > CTO
> > FareCompare.com
> > O: 972 588 1414
> > M: 214 681 9018
> >
>


-- 
Graeme Wallace
CTO
FareCompare.com
O: 972 588 1414
M: 214 681 9018

--20cf301cbed69ffa2004da065c8b--