hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weishung Chung <weish...@gmail.com>
Subject Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Date Fri, 13 May 2011 07:45:32 GMT
What's the status on this package? Is it mature enough?
 I am using it in my project, tried out the write method yesterday and going
to incorporate into read method tomorrow.

On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <alex.baranov.v@gmail.com>wrote:

> > The start/end rows may be written twice.
>
> Yeah, I know. I meant that size of startRow+stopRow data is "bearable" in
> attribute value no matter how long are they (keys), since we already OK
> with
> transferring them initially (i.e. we should be OK with transferring 2x
> times
> more).
>
> So, what about the suggestion of sourceScan attribute value I mentioned? If
> you can tell why it isn't sufficient in your case, I'd have more info to
> think about better suggestion ;)
>
> > It is Okay to keep all versions of your patch in the JIRA.
> > Maybe the second should be named HBASE-3811-v2.patch<
> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
> >?
>
> np. Can do that. Just thought that they (patches) can be sorted by date to
> find out the final one (aka "convention over naming-rules").
>
> Alex.
>
> On Wed, May 11, 2011 at 11:13 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > >> Though it might be ok, since we anyways "transfer" start/stop rows
> with
> > Scan object.
> > In write() method, we now have:
> >     Bytes.writeByteArray(out, this.startRow);
> >     Bytes.writeByteArray(out, this.stopRow);
> > ...
> >       for (Map.Entry<String, byte[]> attr : this.attributes.entrySet()) {
> >         WritableUtils.writeString(out, attr.getKey());
> >         Bytes.writeByteArray(out, attr.getValue());
> >       }
> > The start/end rows may be written twice.
> >
> > Of course, you have full control over how to generate the unique ID for
> > "sourceScan" attribute.
> >
> > It is Okay to keep all versions of your patch in the JIRA. Maybe the
> second
> > should be named HBASE-3811-v2.patch<
> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
> >?
> >
> > Thanks
> >
> >
> > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
> >
> >> > Can you remove the first version ?
> >> Isn't it ok to keep it in JIRA issue?
> >>
> >>
> >> > In HBaseWD, can you use reflection to detect whether Scan supports
> >> setAttribute() ?
> >> > If it does, can you encode start row and end row as "sourceScan"
> >> attribute ?
> >>
> >> Yeah, smth like this is going to be implemented. Though I'd still want
> to
> >> hear from the devs the story about Scan version.
> >>
> >>
> >> > One consideration is that start row or end row may be quite long.
> >>
> >> Yeah, that is was my though too at first. Though it might be ok, since
> we
> >> anyways "transfer" start/stop rows with Scan object.
> >>
> >> > What do you think ?
> >>
> >> I'd love to hear from you is this variant I mentioned is what we are
> >> looking at here:
> >>
> >>
> >> > From what I understand, you want to distinguish scans fired by the
> same
> >> distributed scan.
> >> > I.e. group scans which were fired by single distributed scan. If
> that's
> >> what you want, distributed
> >> > scan can generate unique ID and set, say "sourceScan" attribute to its
> >> value. This way we'll
> >> > have <# of distinct "sourceScan" attribute values> = <number of
> >> distributed scans invoked by
> >> > client side> and two scans on server side will have the same
> >> "sourceScan" attribute iff they
> >> > "belong" to same distributed scan.
> >>
> >>
> >> Alex Baranau
> >> ----
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> >> HBase
> >>
> >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>
> >>> Alex:
> >>> Your second patch looks good.
> >>> Can you remove the first version ?
> >>>
> >>> In HBaseWD, can you use reflection to detect whether Scan supports
> >>> setAttribute() ?
> >>> If it does, can you encode start row and end row as "sourceScan"
> >>> attribute ?
> >>>
> >>> One consideration is that start row or end row may be quite long.
> >>> Ideally we should store hash code of source Scan object as "sourceScan"
> >>> attribute. But Scan doesn't implement hashCode(). We can add it, that
> would
> >>> require running all Scan related tests.
> >>>
> >>> What do you think ?
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau <
> alex.baranov.v@gmail.com>wrote:
> >>>
> >>>> Sorry for the delay in response (public holidays here).
> >>>>
> >>>> This depends on what info you are looking for on server side.
> >>>>
> >>>> From what I understand, you want to distinguish scans fired by the
> same
> >>>> distributed scan. I.e. group scans which were fired by single
> distributed
> >>>> scan. If that's what you want, distributed scan can generate unique
ID
> and
> >>>> set, say "sourceScan" attribute to its value. This way we'll have <#
> of
> >>>> distinct "sourceScan" attribute values> = <number of distributed
scans
> >>>> invoked by client side> and two scans on server side will have the
> same
> >>>> "sourceScan" attribute iff they "belong" to same distributed scan.
> >>>>
> >>>> Is this what are you looking for?
> >>>>
> >>>> Alex Baranau
> >>>>
> >>>> P.S. attached patch for HBASE-3811<
> https://issues.apache.org/jira/browse/HBASE-3811>
> >>>> .
> >>>> P.S-2. should this conversation be moved to dev list?
> >>>>
> >>>> ----
> >>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> >>>> HBase
> >>>>
> >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
> >>>>
> >>>>> Alex:
> >>>>> What type of identification should we put in the map of the Scan
> object
> >>>>> ?
> >>>>> I am thinking of using the Id of RowKeyDistributor. But the user
can
> >>>>> use same distributor on multiple scans.
> >>>>>
> >>>>> Please share your thought.
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau <
> >>>>> alex.baranov.v@gmail.com> wrote:
> >>>>>
> >>>>>> https://issues.apache.org/jira/browse/HBASE-3811
> >>>>>>
> >>>>>> Alex Baranau
> >>>>>> ----
> >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
Hadoop
> -
> >>>>>> HBase
> >>>>>>
> >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> >>>>>>
> >>>>>> > My plan was to make regions that have active scanners more
stable
> -
> >>>>>> trying
> >>>>>> > not to move them when balancing.
> >>>>>> > I prefer second approach - adding custom attribute(s) to
Scan so
> >>>>>> that the
> >>>>>> > Scans created by the method below can be 'grouped'.
> >>>>>> >
> >>>>>> > If you can file a JIRA, that would be great.
> >>>>>> >
> >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau <
> >>>>>> alex.baranov.v@gmail.com
> >>>>>> > >wrote:
> >>>>>> >
> >>>>>> > > Aha, so you want to "count" it as single scan (or
just
> >>>>>> differently) when
> >>>>>> > > determining the load?
> >>>>>> > >
> >>>>>> > > The current code looks like this:
> >>>>>> > >
> >>>>>> > > class DistributedScanner:
> >>>>>> > >  public static DistributedScanner create(HTable hTable,
Scan
> >>>>>> original,
> >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws IOException
{
> >>>>>> > >    byte[][] startKeys =
> >>>>>> > > keyDistributor.getAllDistributedKeys(original.getStartRow());
> >>>>>> > >    byte[][] stopKeys =
> >>>>>> > > keyDistributor.getAllDistributedKeys(original.getStopRow());
> >>>>>> > >    Scan[] scans = new Scan[startKeys.length];
> >>>>>> > >    for (byte i = 0; i < startKeys.length; i++)
{
> >>>>>> > >      scans[i] = new Scan(original);
> >>>>>> > >      scans[i].setStartRow(startKeys[i]);
> >>>>>> > >      scans[i].setStopRow(stopKeys[i]);
> >>>>>> > >    }
> >>>>>> > >
> >>>>>> > >    ResultScanner[] rss = new ResultScanner[startKeys.length];
> >>>>>> > >    for (byte i = 0; i < scans.length; i++) {
> >>>>>> > >      rss[i] = hTable.getScanner(scans[i]);
> >>>>>> > >    }
> >>>>>> > >
> >>>>>> > >    return new DistributedScanner(rss);
> >>>>>> > >  }
> >>>>>> > >
> >>>>>> > > This is client code. To make these scans "identifiable"
we need
> to
> >>>>>> either
> >>>>>> > > use some different (derived from Scan) class or add
some
> attribute
> >>>>>> to
> >>>>>> > them.
> >>>>>> > > There's no API for doing the latter. But we can do
the former,
> but
> >>>>>> I
> >>>>>> > don't
> >>>>>> > > really like the idea of creating extra class (with
no extra
> >>>>>> > functionality)
> >>>>>> > > just to distinguish it from the base one.
> >>>>>> > >
> >>>>>> > > If you can share why/how do you want to treat them
differently
> on
> >>>>>> server
> >>>>>> > > side, that would be helpful.
> >>>>>> > >
> >>>>>> > > Alex Baranau
> >>>>>> > > ----
> >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene
- Nutch -
> >>>>>> Hadoop -
> >>>>>> > HBase
> >>>>>> > >
> >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu <yuzhihong@gmail.com>
> >>>>>> wrote:
> >>>>>> > >
> >>>>>> > > > My request would be to make the distributed scan
identifiable
> >>>>>> from
> >>>>>> > server
> >>>>>> > > > side.
> >>>>>> > > > :-)
> >>>>>> > > >
> >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau
<
> >>>>>> > alex.baranov.v@gmail.com
> >>>>>> > > > >wrote:
> >>>>>> > > >
> >>>>>> > > > > > Basically bucketsCount may not equal
number of regions for
> >>>>>> the
> >>>>>> > > > underlying
> >>>>>> > > > > > table.
> >>>>>> > > > >
> >>>>>> > > > > True: e.g. when there's only one region
that holds data for
> >>>>>> the whole
> >>>>>> > > > table
> >>>>>> > > > > (not many records in table yet), distributed
scan will fire
> N
> >>>>>> scans
> >>>>>> > > > against
> >>>>>> > > > > the same region.
> >>>>>> > > > > On the other hand, in case there are huge
number of regions
> >>>>>> for
> >>>>>> > single
> >>>>>> > > > > table, each scan can span over multiple
regions.
> >>>>>> > > > >
> >>>>>> > > > > > I need to deal with normal scan and
"distributed scan" at
> >>>>>> server
> >>>>>> > > side.
> >>>>>> > > > >
> >>>>>> > > > > With current implementation "distributed"
scan won't be
> >>>>>> recognized as
> >>>>>> > > > > something special on the server side. It
will be an ordinary
> >>>>>> scan.
> >>>>>> > > Though
> >>>>>> > > > > the number of scan will increase, given
that the typical
> >>>>>> situation is
> >>>>>> > > > "many
> >>>>>> > > > > regions for single table", the scans of
the same
> "distributed
> >>>>>> scan"
> >>>>>> > are
> >>>>>> > > > > likely not to hit the same region.
> >>>>>> > > > >
> >>>>>> > > > > Not sure if I answered your questions here.
Feel free to ask
> >>>>>> more ;)
> >>>>>> > > > >
> >>>>>> > > > > Alex Baranau
> >>>>>> > > > > ----
> >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr
- Lucene - Nutch -
> >>>>>> Hadoop -
> >>>>>> > > > HBase
> >>>>>> > > > >
> >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu
<
> yuzhihong@gmail.com>
> >>>>>> wrote:
> >>>>>> > > > >
> >>>>>> > > > > > Alex:
> >>>>>> > > > > > If you read this, you would know why
I asked:
> >>>>>> > > > > > https://issues.apache.org/jira/browse/HBASE-3679
> >>>>>> > > > > >
> >>>>>> > > > > > I need to deal with normal scan and
"distributed scan" at
> >>>>>> server
> >>>>>> > > side.
> >>>>>> > > > > > Basically bucketsCount may not equal
number of regions for
> >>>>>> the
> >>>>>> > > > underlying
> >>>>>> > > > > > table.
> >>>>>> > > > > >
> >>>>>> > > > > > Cheers
> >>>>>> > > > > >
> >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex
Baranau <
> >>>>>> > > > alex.baranov.v@gmail.com
> >>>>>> > > > > > >wrote:
> >>>>>> > > > > >
> >>>>>> > > > > > > Hi Ted,
> >>>>>> > > > > > >
> >>>>>> > > > > > > We currently use this tool in
the scenario where data is
> >>>>>> consumed
> >>>>>> > > by
> >>>>>> > > > > > > MapReduce jobs, so we haven't
tested the performance of
> >>>>>> pure
> >>>>>> > > > > "distributed
> >>>>>> > > > > > > scan" (i.e. N scans instead of
1) a lot. I expect it to
> be
> >>>>>> close
> >>>>>> > to
> >>>>>> > > > > > simple
> >>>>>> > > > > > > scan performance, or may be sometimes
even faster
> >>>>>> depending on
> >>>>>> > your
> >>>>>> > > > > data
> >>>>>> > > > > > > access patterns. E.g. in case
you write timeseries data
> >>>>>> > > (sequential)
> >>>>>> > > > > > which
> >>>>>> > > > > > > is written into the single region
at a time, then e.g.
> if
> >>>>>> you
> >>>>>> > > access
> >>>>>> > > > > > delta
> >>>>>> > > > > > > for further processing/analysis
(esp. if from not single
> >>>>>> client)
> >>>>>> > > > these
> >>>>>> > > > > > > scans
> >>>>>> > > > > > > are likely to hit the same region
or couple of regions
> at
> >>>>>> a time,
> >>>>>> > > > which
> >>>>>> > > > > > may
> >>>>>> > > > > > > perform worse comparing to many
scans hitting data that
> is
> >>>>>> much
> >>>>>> > > > better
> >>>>>> > > > > > > spread over region servers.
> >>>>>> > > > > > >
> >>>>>> > > > > > > As for map-reduce job the approach
should not affect
> >>>>>> reading
> >>>>>> > > > > performance
> >>>>>> > > > > > at
> >>>>>> > > > > > > all: it's just that there are
bucketsCount times more
> >>>>>> splits and
> >>>>>> > > > hence
> >>>>>> > > > > > > bucketsCount times more Map tasks.
In many cases this
> even
> >>>>>> > improves
> >>>>>> > > > > > overall
> >>>>>> > > > > > > performance of the MR job since
work is better
> distributed
> >>>>>> over
> >>>>>> > > > cluster
> >>>>>> > > > > > > (esp. in situation when the aim
is to constantly process
> >>>>>> the
> >>>>>> > coming
> >>>>>> > > > > delta
> >>>>>> > > > > > > which usually resides in one or
just couple of regions
> >>>>>> depending
> >>>>>> > on
> >>>>>> > > > > > > processing frequency).
> >>>>>> > > > > > >
> >>>>>> > > > > > > If you can share details on your
case, that will help to
> >>>>>> > understand
> >>>>>> > > > > what
> >>>>>> > > > > > > effect(s) to expect from using
this approach.
> >>>>>> > > > > > >
> >>>>>> > > > > > > Alex Baranau
> >>>>>> > > > > > > ----
> >>>>>> > > > > > > Sematext :: http://sematext.com/
:: Solr - Lucene -
> Nutch
> >>>>>> -
> >>>>>> > Hadoop
> >>>>>> > > -
> >>>>>> > > > > > HBase
> >>>>>> > > > > > >
> >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM,
Ted Yu <
> >>>>>> yuzhihong@gmail.com>
> >>>>>> > > wrote:
> >>>>>> > > > > > >
> >>>>>> > > > > > > > Interesting project, Alex.
> >>>>>> > > > > > > > Since there're bucketsCount
scanners compared to one
> >>>>>> scanner
> >>>>>> > > > > > originally,
> >>>>>> > > > > > > > have you performed load testing
to see the impact ?
> >>>>>> > > > > > > >
> >>>>>> > > > > > > > Thanks
> >>>>>> > > > > > > >
> >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25
AM, Alex Baranau <
> >>>>>> > > > > > alex.baranov.v@gmail.com
> >>>>>> > > > > > > > >wrote:
> >>>>>> > > > > > > >
> >>>>>> > > > > > > > > Hello guys,
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > I'd like to introduce
a new small java project/lib
> >>>>>> around
> >>>>>> > > HBase:
> >>>>>> > > > > > > HBaseWD.
> >>>>>> > > > > > > > > It
> >>>>>> > > > > > > > > is aimed to help with
distribution of the load
> (across
> >>>>>> > > > > regionservers)
> >>>>>> > > > > > > > when
> >>>>>> > > > > > > > > writing sequential (becasue
of the row key nature)
> >>>>>> records.
> >>>>>> > It
> >>>>>> > > > > > > implements
> >>>>>> > > > > > > > > the solution which was
discussed several times on
> this
> >>>>>> > mailing
> >>>>>> > > > list
> >>>>>> > > > > > > (e.g.
> >>>>>> > > > > > > > > here: http://search-hadoop.com/m/gNRA82No5Wk).
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Please find the sources
at
> >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's
> >>>>>> > > > > > > > > also
> >>>>>> > > > > > > > > a jar of current version
for convenience). It is
> very
> >>>>>> easy to
> >>>>>> > > > make
> >>>>>> > > > > > use
> >>>>>> > > > > > > of
> >>>>>> > > > > > > > > it: e.g. I added it
to one existing project with 1+2
> >>>>>> lines of
> >>>>>> > > > code
> >>>>>> > > > > > (one
> >>>>>> > > > > > > > > where I write to HBase
and 2 for configuring
> MapReduce
> >>>>>> job).
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Any feedback is highly
appreciated!
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Please find below the
short intro to the lib [1].
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Alex Baranau
> >>>>>> > > > > > > > > ----
> >>>>>> > > > > > > > > Sematext :: http://sematext.com/
:: Solr - Lucene -
> >>>>>> Nutch -
> >>>>>> > > > Hadoop
> >>>>>> > > > > -
> >>>>>> > > > > > > > HBase
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > [1]
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Description:
> >>>>>> > > > > > > > > ------------
> >>>>>> > > > > > > > > HBaseWD stands for Distributing
(sequential) Writes.
> >>>>>> It was
> >>>>>> > > > > inspired
> >>>>>> > > > > > by
> >>>>>> > > > > > > > > discussions on HBase
mailing lists around the
> problem
> >>>>>> of
> >>>>>> > > choosing
> >>>>>> > > > > > > > between:
> >>>>>> > > > > > > > > * writing records with
sequential row keys (e.g.
> >>>>>> time-series
> >>>>>> > > data
> >>>>>> > > > > > with
> >>>>>> > > > > > > > row
> >>>>>> > > > > > > > > key
> >>>>>> > > > > > > > >  built based on ts)
> >>>>>> > > > > > > > > * using random unique
IDs for records
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > First approach makes
possible to perform fast range
> >>>>>> scans
> >>>>>> > with
> >>>>>> > > > help
> >>>>>> > > > > > of
> >>>>>> > > > > > > > > setting
> >>>>>> > > > > > > > > start/stop keys on Scanner,
but creates single
> region
> >>>>>> server
> >>>>>> > > > > > > hot-spotting
> >>>>>> > > > > > > > > problem upon writing
data (as row keys go in
> sequence
> >>>>>> all
> >>>>>> > > records
> >>>>>> > > > > end
> >>>>>> > > > > > > up
> >>>>>> > > > > > > > > written into a single
region at a time).
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Second approach aims
for fastest writing performance
> >>>>>> by
> >>>>>> > > > > distributing
> >>>>>> > > > > > > new
> >>>>>> > > > > > > > > records over random
regions but makes not possible
> >>>>>> doing fast
> >>>>>> > > > range
> >>>>>> > > > > > > scans
> >>>>>> > > > > > > > > against written data.
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > The suggested approach
stays in the middle of the
> two
> >>>>>> above
> >>>>>> > and
> >>>>>> > > > > > proved
> >>>>>> > > > > > > to
> >>>>>> > > > > > > > > perform well by distributing
records over the
> cluster
> >>>>>> during
> >>>>>> > > > > writing
> >>>>>> > > > > > > data
> >>>>>> > > > > > > > > while allowing range
scans over it. HBaseWD provides
> >>>>>> very
> >>>>>> > > simple
> >>>>>> > > > > API
> >>>>>> > > > > > to
> >>>>>> > > > > > > > > work with which makes
it perfect to use with
> existing
> >>>>>> code.
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Please refer to unit-tests
for lib usage info as
> they
> >>>>>> aimed
> >>>>>> > to
> >>>>>> > > > act
> >>>>>> > > > > as
> >>>>>> > > > > > > > > example.
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Brief Usage Info (Examples):
> >>>>>> > > > > > > > > ----------------------------
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Distributing records
with sequential keys which are
> >>>>>> being
> >>>>>> > > written
> >>>>>> > > > > in
> >>>>>> > > > > > up
> >>>>>> > > > > > > > to
> >>>>>> > > > > > > > > Byte.MAX_VALUE buckets:
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    byte bucketsCount
= (byte) 32; // distributing
> into
> >>>>>> 32
> >>>>>> > > buckets
> >>>>>> > > > > > > > >    RowKeyDistributor
keyDistributor =
> >>>>>> > > > > > > > >                    
      new
> >>>>>> > > > > > > > > RowKeyDistributorByOneBytePrefix(bucketsCount);
> >>>>>> > > > > > > > >    for (int i = 0; i
< 100; i++) {
> >>>>>> > > > > > > > >      Put put = new
> >>>>>> > > > > > Put(keyDistributor.getDistributedKey(originalKey));
> >>>>>> > > > > > > > >      ... // add values
> >>>>>> > > > > > > > >      hTable.put(put);
> >>>>>> > > > > > > > >    }
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Performing a range scan
over written data
> (internally
> >>>>>> > > > > <bucketsCount>
> >>>>>> > > > > > > > > scanners
> >>>>>> > > > > > > > > executed):
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    Scan scan = new Scan(startKey,
stopKey);
> >>>>>> > > > > > > > >    ResultScanner rs
=
> >>>>>> DistributedScanner.create(hTable, scan,
> >>>>>> > > > > > > > > keyDistributor);
> >>>>>> > > > > > > > >    for (Result current
: rs) {
> >>>>>> > > > > > > > >      ...
> >>>>>> > > > > > > > >    }
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Performing mapreduce
job over written data chunk
> >>>>>> specified by
> >>>>>> > > > Scan:
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    Configuration conf
= HBaseConfiguration.create();
> >>>>>> > > > > > > > >    Job job = new Job(conf,
"testMapreduceJob");
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    Scan scan = new Scan(startKey,
stopKey);
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    TableMapReduceUtil.initTableMapperJob("table",
> >>>>>> scan,
> >>>>>> > > > > > > > >      RowCounterMapper.class,
> >>>>>> ImmutableBytesWritable.class,
> >>>>>> > > > > > > Result.class,
> >>>>>> > > > > > > > > job);
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    // Substituting standard
TableInputFormat which
> was
> >>>>>> set in
> >>>>>> > > > > > > > >    // TableMapReduceUtil.initTableMapperJob(...)
> >>>>>> > > > > > > > >
>  job.setInputFormatClass(WdTableInputFormat.class);
> >>>>>> > > > > > > > >    keyDistributor.addInfo(job.getConfiguration());
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > Extending Row Keys Distributing
Patterns:
> >>>>>> > > > > > > > > -----------------------------------------
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > > HBaseWD is designed
to be flexible and to support
> >>>>>> custom row
> >>>>>> > > key
> >>>>>> > > > > > > > > distribution
> >>>>>> > > > > > > > > approaches. To define
custom row key distributing
> >>>>>> logic just
> >>>>>> > > > > > implement
> >>>>>> > > > > > > > > AbstractRowKeyDistributor
abstract class which is
> >>>>>> really very
> >>>>>> > > > > simple:
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > > >    public abstract class
AbstractRowKeyDistributor
> >>>>>> implements
> >>>>>> > > > > > > > > Parametrizable {
> >>>>>> > > > > > > > >      public abstract
byte[] getDistributedKey(byte[]
> >>>>>> > > > originalKey);
> >>>>>> > > > > > > > >      public abstract
byte[] getOriginalKey(byte[]
> >>>>>> > adjustedKey);
> >>>>>> > > > > > > > >      public abstract
byte[][]
> >>>>>> getAllDistributedKeys(byte[]
> >>>>>> > > > > > > originalKey);
> >>>>>> > > > > > > > >      ... // some utility
methods
> >>>>>> > > > > > > > >    }
> >>>>>> > > > > > > > >
> >>>>>> > > > > > > >
> >>>>>> > > > > > >
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message