hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: getSplits() in TableInputFormatBase
Date Sun, 11 Apr 2010 09:27:18 GMT
You have 1 region per table and thats why you are getting 1 split when you
scan any of those tables...

Moreover, the number of map tasks configuration is ignored when you are
running in pseudo dist mode since the job tracker is local.



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Sun, Apr 11, 2010 at 2:23 AM, john smith <js1987.smith@gmail.com> wrote:

> Amandeep,
>
> No . I have 3 tables A,B,C ..Does the number of regions 5 include 1 region
> from each META and ROOT also?
>
> I should get numSplits = 3 (total number of user regions) . But I am
> getting
> 1 .
>
> Thanks
>
>
>
>
>
>
>
>
> On Sun, Apr 11, 2010 at 2:40 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > 3 tables? are you counting root and meta also?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Apr 11, 2010 at 1:57 AM, john smith <js1987.smith@gmail.com>
> > wrote:
> >
> > > From the web interface...
> > >
> > >
> > > number of regions =5
> > > number of tables = 3
> > >
> > > Thanks
> > >
> > >
> > > On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <amansk@gmail.com>
> > > wrote:
> > >
> > > > How many regions do you have?
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Apr 11, 2010 at 1:39 AM, john smith <js1987.smith@gmail.com>
> > > > wrote:
> > > >
> > > > > Amandeep ,
> > > > >
> > > > > Thanks for the explanation . What is the default value to the num
> of
> > > maps
> > > > ?
> > > > > Is it not equal to the num of regions ?
> > > > >
> > > > > Right now I am running HBase in pseudo distributed mode . If I set
> > num
> > > of
> > > > > map tasks to 100000 (some big num)..
> > > > >
> > > > > I get numSplits=1
> > > > >
> > > > > If I dont set any thing .. numSplits =2;
> > > > >
> > > > >
> > > > > Can you explain this.
> > > > >
> > > > > Thanks
> > > > > j.S
> > > > >
> > > > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <
> amansk@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > If you set the number of map tasks as a higher number than the
> > number
> > > > of
> > > > > > regions (I generally set it to 100000 or something like that),
> the
> > > > number
> > > > > > of
> > > > > > splits = number of regions. If you keep it lower, then it
> combines
> > > > > regions
> > > > > > in a single split.
> > > > > >
> > > > > >
> > > > > > Amandeep Khurana
> > > > > > Computer Science Graduate Student
> > > > > > University of California, Santa Cruz
> > > > > >
> > > > > >
> > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <
> > js1987.smith@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Amandeep,
> > > > > > >
> > > > > > > I guess that is not true ,.. See the explanation as in
docs ..
> > > > > > >
> > > > > > >
> > > > > > > "Splits are created in number equal to the smallest between
> > > numSplits
> > > > > and
> > > > > > > the number of HRegion<
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > > > >s
> > > > > > > in the table. If the number of splits is smaller than the
> number
> > of
> > > > > > > HRegion<
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > > > >s
> > > > > > > then splits are spanned across multiple
> > > > > > > HRegion<
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > > > >s
> > > > > > > and are grouped the most evenly possible. In the case splits
> are
> > > > uneven
> > > > > > the
> > > > > > > bigger splits are placed first in the InputSplit array.
 "
> > > > > > >
> > > > > > >
> > > > > > > depending on whether numSplits < (or >)  num of regions
.. it
> > > choses
> > > > > real
> > > > > > > number of splits and the same is done in the code
> > > > > > >
> > > > > > > // Code
> > > > > > >  int realNumSplits = numSplits > startKeys.length?
> > > startKeys.length:
> > > > > > > numSplits;
> > > > > > >
> > > > > > > Here startKeys.length is the number of regions...
> > > > > > >
> > > > > > > Am I true?
> > > > > > >
> > > > > > > Thanks
> > > > > > > j.S
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <
> > > amansk@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > The number of splits is equal to the number of regions...
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <
> > > > js1987.smith@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi ,
> > > > > > > > >
> > > > > > > > > In the method  "public
> org.apache.hadoop.mapred.InputSplit[]
> > > > > > > *getSplits*
> > > > > > > > > (org.apache.hadoop.mapred.JobConf job,
> > > > > > > > >
> > > > > > > > >                                             
         int
> > > > > numSplits)
> > > > > > "
> > > > > > > > >
> > > > > > > > > how is the "numSplits" decided ? I've seen differnt
values
> of
> > > > > > > > > numSplits for different MR jobs . Any reason
for this ?
> > > > > > > > >
> > > > > > > > > Also what if I ignore numsplits and always split
at region
> > > > > > > > > boundaries.I guess that , splitting at region
boundaries
> > makes
> > > > more
> > > > > > > > > sense and improves some what data locality.
> > > > > > > > >
> > > > > > > > > Any comments on the above statement?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > j.S
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message