hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: getSplits() in TableInputFormatBase
Date Sun, 11 Apr 2010 08:20:43 GMT
If you set the number of map tasks as a higher number than the number of
regions (I generally set it to 100000 or something like that), the number of
splits = number of regions. If you keep it lower, then it combines regions
in a single split.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Apr 11, 2010 at 1:15 AM, john smith <js1987.smith@gmail.com> wrote:

> Amandeep,
>
> I guess that is not true ,.. See the explanation as in docs ..
>
>
> "Splits are created in number equal to the smallest between numSplits and
> the number of HRegion<
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> >s
> in the table. If the number of splits is smaller than the number of
> HRegion<
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> >s
> then splits are spanned across multiple
> HRegion<
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> >s
> and are grouped the most evenly possible. In the case splits are uneven the
> bigger splits are placed first in the InputSplit array.  "
>
>
> depending on whether numSplits < (or >)  num of regions .. it choses real
> number of splits and the same is done in the code
>
> // Code
>  int realNumSplits = numSplits > startKeys.length? startKeys.length:
> numSplits;
>
> Here startKeys.length is the number of regions...
>
> Am I true?
>
> Thanks
> j.S
>
>
>
> On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > The number of splits is equal to the number of regions...
> >
> >
> >
> > On Sun, Apr 11, 2010 at 12:54 AM, john smith <js1987.smith@gmail.com>
> > wrote:
> >
> > > Hi ,
> > >
> > > In the method  "public org.apache.hadoop.mapred.InputSplit[]
> *getSplits*
> > > (org.apache.hadoop.mapred.JobConf job,
> > >
> > >                                                       int numSplits) "
> > >
> > > how is the "numSplits" decided ? I've seen differnt values of
> > > numSplits for different MR jobs . Any reason for this ?
> > >
> > > Also what if I ignore numsplits and always split at region
> > > boundaries.I guess that , splitting at region boundaries makes more
> > > sense and improves some what data locality.
> > >
> > > Any comments on the above statement?
> > >
> > > Thanks
> > >
> > > j.S
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message