hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: RE: getSplits question
Date Thu, 10 Feb 2011 16:32:09 GMT
Yep, you're right on there.
On Feb 10, 2011 8:15 AM, "Michael Segel" <michael_segel@hotmail.com> wrote:
>
> Ryan,
>
> Just to point out the obvious...
>
> On smaller tables where you don't get enough parallelism, you can manually
force the table's regions to be split.
> My understanding that if/when the table grows it will then go back to
splitting normally.
>
> This way if you have a 'small' look up table that is relatively static,
you manually split it to the 'right' size for your cloud.
> If you are seeding a system, you can do the splits to get good parallelism
and not overload a single region with inserts, then let it go back to its
normal growth pattern and splits.
>
> This would solve the OP's issue and as you point out, not worry about
getSplits().
>
> Does this make sense, or am I missing something?
>
> -Mike
>
>> Date: Wed, 9 Feb 2011 23:54:19 -0800
>> Subject: Re: getSplits question
>> From: ryanobjc@gmail.com
>> To: user@hbase.apache.org
>> CC: hbase-user@hadoop.apache.org
>>
>> By default each map gets the contents of 1 region. A region is by
>> default a maximum of 256MB. There is no trivial way to generally
>> bisect a region in half, in terms of row count, by just knowing what
>> we known (start, end key).
>>
>> For very large tables that have > 100 regions, this algorithm works
>> really well and you get some good parallelism. If you want to see a
>> lot of parallelism out of 1 region, you might have to work a lot
>> harder. Or reduce your region size and have more regions. Be warned
>> though, that more regions has performance hits in other areas
>> (specifically server startup/shutdown/assignment times). So you
>> probably dont want 50,000 32MB regions.
>>
>> -ryan
>>
>> On Wed, Feb 9, 2011 at 11:46 PM, Geoff Hendrey <ghendrey@decarta.com>
wrote:
>> > Oh, I definitely don't *need* my own to run mapreduce. However, if I
want to control the number of records handled by each mapper (splitsize) and
the startrow and endrow, then I thought I had to write my own getSplits().
Is there another way to accomplish this, because I do need the combination
of controlled splitsize and start/endrow.
>> >
>> > -geoff
>> >
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> > Sent: Wednesday, February 09, 2011 11:43 PM
>> > To: user@hbase.apache.org
>> > Cc: hbase-user@hadoop.apache.org
>> > Subject: Re: getSplits question
>> >
>> > You shouldn't need to write your own getSplits() method to run a map
>> > reduce, I never did at least...
>> >
>> > -ryan
>> >
>> > On Wed, Feb 9, 2011 at 11:36 PM, Geoff Hendrey <ghendrey@decarta.com>
wrote:
>> >> Are endrows inclusive or exclusive? The docs say exclusive, but then
the
>> >> question arises as to how to form the last split for getSplits(). The
>> >> code below runs fine, but I believe it is omitting some rows, perhaps
>> >> b/c of the exclusive end row. For the final split, should the endrow
be
>> >> null? I tried that, and got what appeared to be a final split without
an
>> >> endrow at all. Would appreciate a pointer to the correct
implementation
>> >> of getSplits in which I desire to provide a startrow, endrow, and
>> >> splitsize. Apparently this isn't it J :
>> >>
>> >>
>> >>
>> >> int splitSize = context.getConfiguration().getInt("splitsize", 1000);
>> >>
>> >> byte[] splitStop = null;
>> >>
>> >> String hostname = null;
>> >>
>> >> while ((results = resultScanner.next(splitSize)).length
>> >>> 0) {
>> >>
>> >> // System.out.println("results
>> >> :-------------------------- "+results);
>> >>
>> >> byte[] splitStart = results[0].getRow();
>> >>
>> >> splitStop = results[results.length - 1].getRow();
>> >> //I think this is a problem...we don't actually include this row in
the
>> >> split since it's exclusive..revisit this and correct
>> >>
>> >> HRegionLocation location =
>> >> table.getRegionLocation(splitStart);
>> >>
>> >> hostname =
>> >> location.getServerAddress().getHostname();
>> >>
>> >> InputSplit split = new
>> >> TableSplit(table.getTableName(), splitStart, splitStop, hostname);
>> >>
>> >> splits.add(split);
>> >>
>> >> System.out.println("initializing splits: " +
>> >> split.toString());
>> >>
>> >> }
>> >>
>> >> resultScanner.close();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -g
>> >>
>> >>
>> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message