hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weishung Chung <weish...@gmail.com>
Subject Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Date Thu, 19 May 2011 13:45:37 GMT
Awesome, I'm going to check it out and use it today. Thank you :)

On Thu, May 19, 2011 at 8:14 AM, Alex Baranau <alex.baranov.v@gmail.com>wrote:

> Implemented RowKeyDistributorByHashPrefix. From README:
>
> Another useful RowKeyDistributor is RowKeyDistributorByHashPrefix. Please
> see
> example below. It creates "distributed key" based on original key value
>
> so that later when you have original key and want to update the record you
> can
> calculate distributed key without roundtrip to HBase.
>
> AbstractRowKeyDistributor keyDistributor =
>         new RowKeyDistributorByHashPrefix(
>                   new RowKeyDistributorByHashPrefix.OneByteSimpleHash(15));
>
> You can use your own hashing logic here by implementing simple interface:
>
> public static interface Hasher extends Parametrizable {
>   byte[] getHashPrefix(byte[] originalKey);
>   byte[][] getAllPossiblePrefixes();
> }
>
>
> OneByteSimpleHash implements very simple hash algorythm: simple sum of all
> bytes in row key % maxBuckets. In example above 15 is maxBuckets count. You
> can use buckets count # up to 255. Please, use wisely, as (the same thing as
> with byOneByte prefix) Disctributed scanner will instantiate this number of
> scans under the hood.
>
> With this row key hash-based distributor, you can find out the distributed
> key (and use it to update the record) without roundtrip to HBase. From
> unit-test:
>
>     // Testing simple get
>     byte[] originalKey = new byte[] {123, 124, 122};
>
>     Put put = new Put(keyDistributor.getDistributedKey(originalKey));
>     put.add(CF, QUAL, Bytes.toBytes("some"));
>     hTable.put(put);
>
>     byte[] distributedKey = keyDistributor.getDistributedKey(originalKey);
>     Result result = hTable.get(new Get(distributedKey));
>     Assert.assertArrayEquals(originalKey,
> keyDistributor.getOriginalKey(result.getRow()));
>     Assert.assertArrayEquals(Bytes.toBytes("some"), result.getValue(CF,
> QUAL));
>
>
> NOTE: This feature is included in hbasewd-0.1.0-SNAPSHOT-2011.05.19.jar
> (downloadable from https://github.com/sematext/HBaseWD)
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>
> P.S.
> > Can you summarize HBaseWD in your blog
> That is on my todo list! You pushed it higher to the top priority items ;)
>
>
> On Thu, May 19, 2011 at 6:50 AM, Weishung Chung <weishung@gmail.com>wrote:
>
>> I have another question about option 2. It seems like I need to handle the
>> distributed scan differently to read from start row to end row, assuming 1
>> byte hash of the original key is used as prefix since the order of the
>> original key range is different from the resulting distributed key range.
>>
>> On Wed, May 18, 2011 at 6:18 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> > Alex:
>> > Can you summarize HBaseWD in your blog, including points 1 and 2 below ?
>> >
>> > Thanks
>> >
>> > On Wed, May 18, 2011 at 8:03 AM, Alex Baranau <alex.baranov.v@gmail.com
>> > >wrote:
>> >
>> > > There are several options here. E.g.:
>> > >
>> > > 1) Given that you have "original key" of the record, you can fetch the
>> > > stored record key from HBase and use it to create Put with updated (or
>> > new)
>> > > cells.
>> > >
>> > > Currently you'll need to use distributes scan for that, there's not
>> > > analogue
>> > > for Get operation yet (see
>> https://github.com/sematext/HBaseWD/issues/1
>> > ).
>> > >
>> > > Note: you need to first find out the real key of stored record by
>> > fetching
>> > > data from HBase in case you use included in current lib
>> > > RowKeyDistributorByOneBytePrefix. Alternatively, see next option:
>> > >
>> > > 2) You can create your own RowKeyDistributor implementation which will
>> > > create "distributed key" based on original key value so that later
>> when
>> > you
>> > > have original key and want to update the record you can calculate
>> > > distributed key without roundtrip to HBase.
>> > >
>> > > E.g. your RowKeyDistributor implementation you can calculate 1-byte
>> hash
>> > of
>> > > original key (https://github.com/sematext/HBaseWD/issues/2).
>> > >
>> > >
>> > >
>> > > In either way you don't need to delete record to update some cells of
>> it
>> > or
>> > > add new cells.
>> > >
>> > > Please let me know if you have more Qs!
>> > >
>> > > Alex Baranau
>> > > ----
>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
>> > HBase
>> > >
>> > > On Wed, May 18, 2011 at 1:19 AM, Weishung Chung <weishung@gmail.com>
>> > > wrote:
>> > >
>> > > > I have another question. For overwriting, do I need to delete the
>> > > existing
>> > > > one before re-writing it?
>> > > >
>> > > > On Sat, May 14, 2011 at 10:17 AM, Weishung Chung <
>> weishung@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Yes, it's simple yet useful. I am integrating it. Thanks alot :)
>> > > > >
>> > > > >
>> > > > > On Fri, May 13, 2011 at 3:12 PM, Alex Baranau <
>> > > alex.baranov.v@gmail.com
>> > > > >wrote:
>> > > > >
>> > > > >> Thanks for the interest!
>> > > > >>
>> > > > >> We are using it in production. It is simple and hence quite
>> stable.
>> > > > Though
>> > > > >> some minor pieces are missing (like
>> > > > >> https://github.com/sematext/HBaseWD/issues/1) this doesn't
>> affect
>> > > > >> stability
>> > > > >> and/or major functionality.
>> > > > >>
>> > > > >> Alex Baranau
>> > > > >> ----
>> > > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
>> Hadoop
>> > -
>> > > > >> HBase
>> > > > >>
>> > > > >> On Fri, May 13, 2011 at 10:45 AM, Weishung Chung <
>> > weishung@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > What's the status on this package? Is it mature enough?
>> > > > >> >  I am using it in my project, tried out the write method
>> yesterday
>> > > and
>> > > > >> > going
>> > > > >> > to incorporate into read method tomorrow.
>> > > > >> >
>> > > > >> > On Wed, May 11, 2011 at 3:41 PM, Alex Baranau <
>> > > > alex.baranov.v@gmail.com
>> > > > >> > >wrote:
>> > > > >> >
>> > > > >> > > > The start/end rows may be written twice.
>> > > > >> > >
>> > > > >> > > Yeah, I know. I meant that size of startRow+stopRow data is
>> > > > "bearable"
>> > > > >> in
>> > > > >> > > attribute value no matter how long are they (keys), since we
>> > > already
>> > > > >> OK
>> > > > >> > > with
>> > > > >> > > transferring them initially (i.e. we should be OK with
>> > > transferring
>> > > > 2x
>> > > > >> > > times
>> > > > >> > > more).
>> > > > >> > >
>> > > > >> > > So, what about the suggestion of sourceScan attribute value I
>> > > > >> mentioned?
>> > > > >> > If
>> > > > >> > > you can tell why it isn't sufficient in your case, I'd have
>> more
>> > > > info
>> > > > >> to
>> > > > >> > > think about better suggestion ;)
>> > > > >> > >
>> > > > >> > > > It is Okay to keep all versions of your patch in the JIRA.
>> > > > >> > > > Maybe the second should be named HBASE-3811-v2.patch<
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
>> > > > >> > > >?
>> > > > >> > >
>> > > > >> > > np. Can do that. Just thought that they (patches) can be
>> sorted
>> > by
>> > > > >> date
>> > > > >> > to
>> > > > >> > > find out the final one (aka "convention over naming-rules").
>> > > > >> > >
>> > > > >> > > Alex.
>> > > > >> > >
>> > > > >> > > On Wed, May 11, 2011 at 11:13 PM, Ted Yu <
>> yuzhihong@gmail.com>
>> > > > wrote:
>> > > > >> > >
>> > > > >> > > > >> Though it might be ok, since we anyways "transfer"
>> > start/stop
>> > > > >> rows
>> > > > >> > > with
>> > > > >> > > > Scan object.
>> > > > >> > > > In write() method, we now have:
>> > > > >> > > >     Bytes.writeByteArray(out, this.startRow);
>> > > > >> > > >     Bytes.writeByteArray(out, this.stopRow);
>> > > > >> > > > ...
>> > > > >> > > >       for (Map.Entry<String, byte[]> attr :
>> > > > >> this.attributes.entrySet())
>> > > > >> > {
>> > > > >> > > >         WritableUtils.writeString(out, attr.getKey());
>> > > > >> > > >         Bytes.writeByteArray(out, attr.getValue());
>> > > > >> > > >       }
>> > > > >> > > > The start/end rows may be written twice.
>> > > > >> > > >
>> > > > >> > > > Of course, you have full control over how to generate the
>> > unique
>> > > > ID
>> > > > >> for
>> > > > >> > > > "sourceScan" attribute.
>> > > > >> > > >
>> > > > >> > > > It is Okay to keep all versions of your patch in the JIRA.
>> > Maybe
>> > > > the
>> > > > >> > > second
>> > > > >> > > > should be named HBASE-3811-v2.patch<
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://issues.apache.org/jira/secure/attachment/12478694/HBASE-3811.patch
>> > > > >> > > >?
>> > > > >> > > >
>> > > > >> > > > Thanks
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Wed, May 11, 2011 at 1:01 PM, Alex Baranau <
>> > > > >> > alex.baranov.v@gmail.com
>> > > > >> > > >wrote:
>> > > > >> > > >
>> > > > >> > > >> > Can you remove the first version ?
>> > > > >> > > >> Isn't it ok to keep it in JIRA issue?
>> > > > >> > > >>
>> > > > >> > > >>
>> > > > >> > > >> > In HBaseWD, can you use reflection to detect whether
>> Scan
>> > > > >> supports
>> > > > >> > > >> setAttribute() ?
>> > > > >> > > >> > If it does, can you encode start row and end row as
>> > > > "sourceScan"
>> > > > >> > > >> attribute ?
>> > > > >> > > >>
>> > > > >> > > >> Yeah, smth like this is going to be implemented. Though
>> I'd
>> > > still
>> > > > >> want
>> > > > >> > > to
>> > > > >> > > >> hear from the devs the story about Scan version.
>> > > > >> > > >>
>> > > > >> > > >>
>> > > > >> > > >> > One consideration is that start row or end row may be
>> quite
>> > > > long.
>> > > > >> > > >>
>> > > > >> > > >> Yeah, that is was my though too at first. Though it might
>> be
>> > > ok,
>> > > > >> since
>> > > > >> > > we
>> > > > >> > > >> anyways "transfer" start/stop rows with Scan object.
>> > > > >> > > >>
>> > > > >> > > >> > What do you think ?
>> > > > >> > > >>
>> > > > >> > > >> I'd love to hear from you is this variant I mentioned is
>> what
>> > > we
>> > > > >> are
>> > > > >> > > >> looking at here:
>> > > > >> > > >>
>> > > > >> > > >>
>> > > > >> > > >> > From what I understand, you want to distinguish scans
>> fired
>> > > by
>> > > > >> the
>> > > > >> > > same
>> > > > >> > > >> distributed scan.
>> > > > >> > > >> > I.e. group scans which were fired by single distributed
>> > scan.
>> > > > If
>> > > > >> > > that's
>> > > > >> > > >> what you want, distributed
>> > > > >> > > >> > scan can generate unique ID and set, say "sourceScan"
>> > > attribute
>> > > > >> to
>> > > > >> > its
>> > > > >> > > >> value. This way we'll
>> > > > >> > > >> > have <# of distinct "sourceScan" attribute values> =
>> > <number
>> > > of
>> > > > >> > > >> distributed scans invoked by
>> > > > >> > > >> > client side> and two scans on server side will have the
>> > same
>> > > > >> > > >> "sourceScan" attribute iff they
>> > > > >> > > >> > "belong" to same distributed scan.
>> > > > >> > > >>
>> > > > >> > > >>
>> > > > >> > > >> Alex Baranau
>> > > > >> > > >> ----
>> > > > >> > > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> -
>> > > > Hadoop
>> > > > >> -
>> > > > >> > > >> HBase
>> > > > >> > > >>
>> > > > >> > > >> On Wed, May 11, 2011 at 5:15 PM, Ted Yu <
>> yuzhihong@gmail.com
>> > >
>> > > > >> wrote:
>> > > > >> > > >>
>> > > > >> > > >>> Alex:
>> > > > >> > > >>> Your second patch looks good.
>> > > > >> > > >>> Can you remove the first version ?
>> > > > >> > > >>>
>> > > > >> > > >>> In HBaseWD, can you use reflection to detect whether Scan
>> > > > supports
>> > > > >> > > >>> setAttribute() ?
>> > > > >> > > >>> If it does, can you encode start row and end row as
>> > > "sourceScan"
>> > > > >> > > >>> attribute ?
>> > > > >> > > >>>
>> > > > >> > > >>> One consideration is that start row or end row may be
>> quite
>> > > > long.
>> > > > >> > > >>> Ideally we should store hash code of source Scan object
>> as
>> > > > >> > "sourceScan"
>> > > > >> > > >>> attribute. But Scan doesn't implement hashCode(). We can
>> add
>> > > it,
>> > > > >> that
>> > > > >> > > would
>> > > > >> > > >>> require running all Scan related tests.
>> > > > >> > > >>>
>> > > > >> > > >>> What do you think ?
>> > > > >> > > >>>
>> > > > >> > > >>> Thanks
>> > > > >> > > >>>
>> > > > >> > > >>>
>> > > > >> > > >>> On Tue, May 10, 2011 at 5:46 AM, Alex Baranau <
>> > > > >> > > alex.baranov.v@gmail.com>wrote:
>> > > > >> > > >>>
>> > > > >> > > >>>> Sorry for the delay in response (public holidays here).
>> > > > >> > > >>>>
>> > > > >> > > >>>> This depends on what info you are looking for on server
>> > side.
>> > > > >> > > >>>>
>> > > > >> > > >>>> From what I understand, you want to distinguish scans
>> fired
>> > > by
>> > > > >> the
>> > > > >> > > same
>> > > > >> > > >>>> distributed scan. I.e. group scans which were fired by
>> > single
>> > > > >> > > distributed
>> > > > >> > > >>>> scan. If that's what you want, distributed scan can
>> > generate
>> > > > >> unique
>> > > > >> > ID
>> > > > >> > > and
>> > > > >> > > >>>> set, say "sourceScan" attribute to its value. This way
>> > we'll
>> > > > have
>> > > > >> <#
>> > > > >> > > of
>> > > > >> > > >>>> distinct "sourceScan" attribute values> = <number of
>> > > > distributed
>> > > > >> > scans
>> > > > >> > > >>>> invoked by client side> and two scans on server side
>> will
>> > > have
>> > > > >> the
>> > > > >> > > same
>> > > > >> > > >>>> "sourceScan" attribute iff they "belong" to same
>> > distributed
>> > > > >> scan.
>> > > > >> > > >>>>
>> > > > >> > > >>>> Is this what are you looking for?
>> > > > >> > > >>>>
>> > > > >> > > >>>> Alex Baranau
>> > > > >> > > >>>>
>> > > > >> > > >>>> P.S. attached patch for HBASE-3811<
>> > > > >> > > https://issues.apache.org/jira/browse/HBASE-3811>
>> > > > >> > > >>>> .
>> > > > >> > > >>>> P.S-2. should this conversation be moved to dev list?
>> > > > >> > > >>>>
>> > > > >> > > >>>> ----
>> > > > >> > > >>>> Sematext :: http://sematext.com/ :: Solr - Lucene -
>> Nutch
>> > -
>> > > > >> Hadoop
>> > > > >> > -
>> > > > >> > > >>>> HBase
>> > > > >> > > >>>>
>> > > > >> > > >>>> On Fri, May 6, 2011 at 12:06 AM, Ted Yu <
>> > yuzhihong@gmail.com
>> > > >
>> > > > >> > wrote:
>> > > > >> > > >>>>
>> > > > >> > > >>>>> Alex:
>> > > > >> > > >>>>> What type of identification should we put in the map of
>> > the
>> > > > Scan
>> > > > >> > > object
>> > > > >> > > >>>>> ?
>> > > > >> > > >>>>> I am thinking of using the Id of RowKeyDistributor. But
>> > the
>> > > > user
>> > > > >> > can
>> > > > >> > > >>>>> use same distributor on multiple scans.
>> > > > >> > > >>>>>
>> > > > >> > > >>>>> Please share your thought.
>> > > > >> > > >>>>>
>> > > > >> > > >>>>>
>> > > > >> > > >>>>> On Thu, Apr 21, 2011 at 8:32 AM, Alex Baranau <
>> > > > >> > > >>>>> alex.baranov.v@gmail.com> wrote:
>> > > > >> > > >>>>>
>> > > > >> > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3811
>> > > > >> > > >>>>>>
>> > > > >> > > >>>>>> Alex Baranau
>> > > > >> > > >>>>>> ----
>> > > > >> > > >>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene -
>> > Nutch
>> > > -
>> > > > >> > Hadoop
>> > > > >> > > -
>> > > > >> > > >>>>>> HBase
>> > > > >> > > >>>>>>
>> > > > >> > > >>>>>> On Thu, Apr 21, 2011 at 5:57 PM, Ted Yu <
>> > > yuzhihong@gmail.com
>> > > > >
>> > > > >> > > wrote:
>> > > > >> > > >>>>>>
>> > > > >> > > >>>>>> > My plan was to make regions that have active
>> scanners
>> > > more
>> > > > >> > stable
>> > > > >> > > -
>> > > > >> > > >>>>>> trying
>> > > > >> > > >>>>>> > not to move them when balancing.
>> > > > >> > > >>>>>> > I prefer second approach - adding custom
>> attribute(s)
>> > to
>> > > > Scan
>> > > > >> so
>> > > > >> > > >>>>>> that the
>> > > > >> > > >>>>>> > Scans created by the method below can be 'grouped'.
>> > > > >> > > >>>>>> >
>> > > > >> > > >>>>>> > If you can file a JIRA, that would be great.
>> > > > >> > > >>>>>> >
>> > > > >> > > >>>>>> > On Thu, Apr 21, 2011 at 7:23 AM, Alex Baranau <
>> > > > >> > > >>>>>> alex.baranov.v@gmail.com
>> > > > >> > > >>>>>> > >wrote:
>> > > > >> > > >>>>>> >
>> > > > >> > > >>>>>> > > Aha, so you want to "count" it as single scan (or
>> > just
>> > > > >> > > >>>>>> differently) when
>> > > > >> > > >>>>>> > > determining the load?
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > The current code looks like this:
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > class DistributedScanner:
>> > > > >> > > >>>>>> > >  public static DistributedScanner create(HTable
>> > hTable,
>> > > > >> Scan
>> > > > >> > > >>>>>> original,
>> > > > >> > > >>>>>> > > AbstractRowKeyDistributor keyDistributor) throws
>> > > > >> IOException {
>> > > > >> > > >>>>>> > >    byte[][] startKeys =
>> > > > >> > > >>>>>> > >
>> > > > >> keyDistributor.getAllDistributedKeys(original.getStartRow());
>> > > > >> > > >>>>>> > >    byte[][] stopKeys =
>> > > > >> > > >>>>>> > >
>> > > > >> keyDistributor.getAllDistributedKeys(original.getStopRow());
>> > > > >> > > >>>>>> > >    Scan[] scans = new Scan[startKeys.length];
>> > > > >> > > >>>>>> > >    for (byte i = 0; i < startKeys.length; i++) {
>> > > > >> > > >>>>>> > >      scans[i] = new Scan(original);
>> > > > >> > > >>>>>> > >      scans[i].setStartRow(startKeys[i]);
>> > > > >> > > >>>>>> > >      scans[i].setStopRow(stopKeys[i]);
>> > > > >> > > >>>>>> > >    }
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > >    ResultScanner[] rss = new
>> > > > >> ResultScanner[startKeys.length];
>> > > > >> > > >>>>>> > >    for (byte i = 0; i < scans.length; i++) {
>> > > > >> > > >>>>>> > >      rss[i] = hTable.getScanner(scans[i]);
>> > > > >> > > >>>>>> > >    }
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > >    return new DistributedScanner(rss);
>> > > > >> > > >>>>>> > >  }
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > This is client code. To make these scans
>> > "identifiable"
>> > > > we
>> > > > >> > need
>> > > > >> > > to
>> > > > >> > > >>>>>> either
>> > > > >> > > >>>>>> > > use some different (derived from Scan) class or
>> add
>> > > some
>> > > > >> > > attribute
>> > > > >> > > >>>>>> to
>> > > > >> > > >>>>>> > them.
>> > > > >> > > >>>>>> > > There's no API for doing the latter. But we can do
>> > the
>> > > > >> former,
>> > > > >> > > but
>> > > > >> > > >>>>>> I
>> > > > >> > > >>>>>> > don't
>> > > > >> > > >>>>>> > > really like the idea of creating extra class (with
>> no
>> > > > extra
>> > > > >> > > >>>>>> > functionality)
>> > > > >> > > >>>>>> > > just to distinguish it from the base one.
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > If you can share why/how do you want to treat them
>> > > > >> differently
>> > > > >> > > on
>> > > > >> > > >>>>>> server
>> > > > >> > > >>>>>> > > side, that would be helpful.
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > Alex Baranau
>> > > > >> > > >>>>>> > > ----
>> > > > >> > > >>>>>> > > Sematext :: http://sematext.com/ :: Solr - Lucene
>> -
>> > > > Nutch
>> > > > >> -
>> > > > >> > > >>>>>> Hadoop -
>> > > > >> > > >>>>>> > HBase
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > On Thu, Apr 21, 2011 at 4:58 PM, Ted Yu <
>> > > > >> yuzhihong@gmail.com>
>> > > > >> > > >>>>>> wrote:
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> > > > My request would be to make the distributed scan
>> > > > >> > identifiable
>> > > > >> > > >>>>>> from
>> > > > >> > > >>>>>> > server
>> > > > >> > > >>>>>> > > > side.
>> > > > >> > > >>>>>> > > > :-)
>> > > > >> > > >>>>>> > > >
>> > > > >> > > >>>>>> > > > On Thu, Apr 21, 2011 at 5:45 AM, Alex Baranau <
>> > > > >> > > >>>>>> > alex.baranov.v@gmail.com
>> > > > >> > > >>>>>> > > > >wrote:
>> > > > >> > > >>>>>> > > >
>> > > > >> > > >>>>>> > > > > > Basically bucketsCount may not equal number
>> of
>> > > > >> regions
>> > > > >> > for
>> > > > >> > > >>>>>> the
>> > > > >> > > >>>>>> > > > underlying
>> > > > >> > > >>>>>> > > > > > table.
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > True: e.g. when there's only one region that
>> > holds
>> > > > data
>> > > > >> > for
>> > > > >> > > >>>>>> the whole
>> > > > >> > > >>>>>> > > > table
>> > > > >> > > >>>>>> > > > > (not many records in table yet), distributed
>> scan
>> > > > will
>> > > > >> > fire
>> > > > >> > > N
>> > > > >> > > >>>>>> scans
>> > > > >> > > >>>>>> > > > against
>> > > > >> > > >>>>>> > > > > the same region.
>> > > > >> > > >>>>>> > > > > On the other hand, in case there are huge
>> number
>> > of
>> > > > >> > regions
>> > > > >> > > >>>>>> for
>> > > > >> > > >>>>>> > single
>> > > > >> > > >>>>>> > > > > table, each scan can span over multiple
>> regions.
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > > I need to deal with normal scan and
>> > "distributed
>> > > > >> scan"
>> > > > >> > at
>> > > > >> > > >>>>>> server
>> > > > >> > > >>>>>> > > side.
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > With current implementation "distributed" scan
>> > > won't
>> > > > be
>> > > > >> > > >>>>>> recognized as
>> > > > >> > > >>>>>> > > > > something special on the server side. It will
>> be
>> > an
>> > > > >> > ordinary
>> > > > >> > > >>>>>> scan.
>> > > > >> > > >>>>>> > > Though
>> > > > >> > > >>>>>> > > > > the number of scan will increase, given that
>> the
>> > > > >> typical
>> > > > >> > > >>>>>> situation is
>> > > > >> > > >>>>>> > > > "many
>> > > > >> > > >>>>>> > > > > regions for single table", the scans of the
>> same
>> > > > >> > > "distributed
>> > > > >> > > >>>>>> scan"
>> > > > >> > > >>>>>> > are
>> > > > >> > > >>>>>> > > > > likely not to hit the same region.
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > Not sure if I answered your questions here.
>> Feel
>> > > free
>> > > > >> to
>> > > > >> > ask
>> > > > >> > > >>>>>> more ;)
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > Alex Baranau
>> > > > >> > > >>>>>> > > > > ----
>> > > > >> > > >>>>>> > > > > Sematext :: http://sematext.com/ :: Solr -
>> > Lucene
>> > > -
>> > > > >> Nutch
>> > > > >> > -
>> > > > >> > > >>>>>> Hadoop -
>> > > > >> > > >>>>>> > > > HBase
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > On Wed, Apr 20, 2011 at 2:10 PM, Ted Yu <
>> > > > >> > > yuzhihong@gmail.com>
>> > > > >> > > >>>>>> wrote:
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > > > > Alex:
>> > > > >> > > >>>>>> > > > > > If you read this, you would know why I
>> asked:
>> > > > >> > > >>>>>> > > > > >
>> > https://issues.apache.org/jira/browse/HBASE-3679
>> > > > >> > > >>>>>> > > > > >
>> > > > >> > > >>>>>> > > > > > I need to deal with normal scan and
>> > "distributed
>> > > > >> scan"
>> > > > >> > at
>> > > > >> > > >>>>>> server
>> > > > >> > > >>>>>> > > side.
>> > > > >> > > >>>>>> > > > > > Basically bucketsCount may not equal number
>> of
>> > > > >> regions
>> > > > >> > for
>> > > > >> > > >>>>>> the
>> > > > >> > > >>>>>> > > > underlying
>> > > > >> > > >>>>>> > > > > > table.
>> > > > >> > > >>>>>> > > > > >
>> > > > >> > > >>>>>> > > > > > Cheers
>> > > > >> > > >>>>>> > > > > >
>> > > > >> > > >>>>>> > > > > > On Tue, Apr 19, 2011 at 11:11 PM, Alex
>> Baranau
>> > <
>> > > > >> > > >>>>>> > > > alex.baranov.v@gmail.com
>> > > > >> > > >>>>>> > > > > > >wrote:
>> > > > >> > > >>>>>> > > > > >
>> > > > >> > > >>>>>> > > > > > > Hi Ted,
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > We currently use this tool in the scenario
>> > > where
>> > > > >> data
>> > > > >> > is
>> > > > >> > > >>>>>> consumed
>> > > > >> > > >>>>>> > > by
>> > > > >> > > >>>>>> > > > > > > MapReduce jobs, so we haven't tested the
>> > > > >> performance
>> > > > >> > of
>> > > > >> > > >>>>>> pure
>> > > > >> > > >>>>>> > > > > "distributed
>> > > > >> > > >>>>>> > > > > > > scan" (i.e. N scans instead of 1) a lot. I
>> > > expect
>> > > > >> it
>> > > > >> > to
>> > > > >> > > be
>> > > > >> > > >>>>>> close
>> > > > >> > > >>>>>> > to
>> > > > >> > > >>>>>> > > > > > simple
>> > > > >> > > >>>>>> > > > > > > scan performance, or may be sometimes even
>> > > faster
>> > > > >> > > >>>>>> depending on
>> > > > >> > > >>>>>> > your
>> > > > >> > > >>>>>> > > > > data
>> > > > >> > > >>>>>> > > > > > > access patterns. E.g. in case you write
>> > > > timeseries
>> > > > >> > data
>> > > > >> > > >>>>>> > > (sequential)
>> > > > >> > > >>>>>> > > > > > which
>> > > > >> > > >>>>>> > > > > > > is written into the single region at a
>> time,
>> > > then
>> > > > >> e.g.
>> > > > >> > > if
>> > > > >> > > >>>>>> you
>> > > > >> > > >>>>>> > > access
>> > > > >> > > >>>>>> > > > > > delta
>> > > > >> > > >>>>>> > > > > > > for further processing/analysis (esp. if
>> from
>> > > not
>> > > > >> > single
>> > > > >> > > >>>>>> client)
>> > > > >> > > >>>>>> > > > these
>> > > > >> > > >>>>>> > > > > > > scans
>> > > > >> > > >>>>>> > > > > > > are likely to hit the same region or
>> couple
>> > of
>> > > > >> regions
>> > > > >> > > at
>> > > > >> > > >>>>>> a time,
>> > > > >> > > >>>>>> > > > which
>> > > > >> > > >>>>>> > > > > > may
>> > > > >> > > >>>>>> > > > > > > perform worse comparing to many scans
>> hitting
>> > > > data
>> > > > >> > that
>> > > > >> > > is
>> > > > >> > > >>>>>> much
>> > > > >> > > >>>>>> > > > better
>> > > > >> > > >>>>>> > > > > > > spread over region servers.
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > As for map-reduce job the approach should
>> not
>> > > > >> affect
>> > > > >> > > >>>>>> reading
>> > > > >> > > >>>>>> > > > > performance
>> > > > >> > > >>>>>> > > > > > at
>> > > > >> > > >>>>>> > > > > > > all: it's just that there are bucketsCount
>> > > times
>> > > > >> more
>> > > > >> > > >>>>>> splits and
>> > > > >> > > >>>>>> > > > hence
>> > > > >> > > >>>>>> > > > > > > bucketsCount times more Map tasks. In many
>> > > cases
>> > > > >> this
>> > > > >> > > even
>> > > > >> > > >>>>>> > improves
>> > > > >> > > >>>>>> > > > > > overall
>> > > > >> > > >>>>>> > > > > > > performance of the MR job since work is
>> > better
>> > > > >> > > distributed
>> > > > >> > > >>>>>> over
>> > > > >> > > >>>>>> > > > cluster
>> > > > >> > > >>>>>> > > > > > > (esp. in situation when the aim is to
>> > > constantly
>> > > > >> > process
>> > > > >> > > >>>>>> the
>> > > > >> > > >>>>>> > coming
>> > > > >> > > >>>>>> > > > > delta
>> > > > >> > > >>>>>> > > > > > > which usually resides in one or just
>> couple
>> > of
>> > > > >> regions
>> > > > >> > > >>>>>> depending
>> > > > >> > > >>>>>> > on
>> > > > >> > > >>>>>> > > > > > > processing frequency).
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > If you can share details on your case,
>> that
>> > > will
>> > > > >> help
>> > > > >> > to
>> > > > >> > > >>>>>> > understand
>> > > > >> > > >>>>>> > > > > what
>> > > > >> > > >>>>>> > > > > > > effect(s) to expect from using this
>> approach.
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > Alex Baranau
>> > > > >> > > >>>>>> > > > > > > ----
>> > > > >> > > >>>>>> > > > > > > Sematext :: http://sematext.com/ :: Solr
>> -
>> > > > Lucene
>> > > > >> -
>> > > > >> > > Nutch
>> > > > >> > > >>>>>> -
>> > > > >> > > >>>>>> > Hadoop
>> > > > >> > > >>>>>> > > -
>> > > > >> > > >>>>>> > > > > > HBase
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > On Wed, Apr 20, 2011 at 8:17 AM, Ted Yu <
>> > > > >> > > >>>>>> yuzhihong@gmail.com>
>> > > > >> > > >>>>>> > > wrote:
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > > > > Interesting project, Alex.
>> > > > >> > > >>>>>> > > > > > > > Since there're bucketsCount scanners
>> > compared
>> > > > to
>> > > > >> one
>> > > > >> > > >>>>>> scanner
>> > > > >> > > >>>>>> > > > > > originally,
>> > > > >> > > >>>>>> > > > > > > > have you performed load testing to see
>> the
>> > > > impact
>> > > > >> ?
>> > > > >> > > >>>>>> > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > Thanks
>> > > > >> > > >>>>>> > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > On Tue, Apr 19, 2011 at 10:25 AM, Alex
>> > > Baranau
>> > > > <
>> > > > >> > > >>>>>> > > > > > alex.baranov.v@gmail.com
>> > > > >> > > >>>>>> > > > > > > > >wrote:
>> > > > >> > > >>>>>> > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Hello guys,
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > I'd like to introduce a new small java
>> > > > >> project/lib
>> > > > >> > > >>>>>> around
>> > > > >> > > >>>>>> > > HBase:
>> > > > >> > > >>>>>> > > > > > > HBaseWD.
>> > > > >> > > >>>>>> > > > > > > > > It
>> > > > >> > > >>>>>> > > > > > > > > is aimed to help with distribution of
>> the
>> > > > load
>> > > > >> > > (across
>> > > > >> > > >>>>>> > > > > regionservers)
>> > > > >> > > >>>>>> > > > > > > > when
>> > > > >> > > >>>>>> > > > > > > > > writing sequential (becasue of the row
>> > key
>> > > > >> nature)
>> > > > >> > > >>>>>> records.
>> > > > >> > > >>>>>> > It
>> > > > >> > > >>>>>> > > > > > > implements
>> > > > >> > > >>>>>> > > > > > > > > the solution which was discussed
>> several
>> > > > times
>> > > > >> on
>> > > > >> > > this
>> > > > >> > > >>>>>> > mailing
>> > > > >> > > >>>>>> > > > list
>> > > > >> > > >>>>>> > > > > > > (e.g.
>> > > > >> > > >>>>>> > > > > > > > > here:
>> > > http://search-hadoop.com/m/gNRA82No5Wk
>> > > > ).
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Please find the sources at
>> > > > >> > > >>>>>> > > > > > https://github.com/sematext/HBaseWD(there's
>> > > > >> > > >>>>>> > > > > > > > > also
>> > > > >> > > >>>>>> > > > > > > > > a jar of current version for
>> > convenience).
>> > > It
>> > > > >> is
>> > > > >> > > very
>> > > > >> > > >>>>>> easy to
>> > > > >> > > >>>>>> > > > make
>> > > > >> > > >>>>>> > > > > > use
>> > > > >> > > >>>>>> > > > > > > of
>> > > > >> > > >>>>>> > > > > > > > > it: e.g. I added it to one existing
>> > project
>> > > > >> with
>> > > > >> > 1+2
>> > > > >> > > >>>>>> lines of
>> > > > >> > > >>>>>> > > > code
>> > > > >> > > >>>>>> > > > > > (one
>> > > > >> > > >>>>>> > > > > > > > > where I write to HBase and 2 for
>> > > configuring
>> > > > >> > > MapReduce
>> > > > >> > > >>>>>> job).
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Any feedback is highly appreciated!
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Please find below the short intro to
>> the
>> > > lib
>> > > > >> [1].
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Alex Baranau
>> > > > >> > > >>>>>> > > > > > > > > ----
>> > > > >> > > >>>>>> > > > > > > > > Sematext :: http://sematext.com/ ::
>> Solr
>> > -
>> > > > >> Lucene
>> > > > >> > -
>> > > > >> > > >>>>>> Nutch -
>> > > > >> > > >>>>>> > > > Hadoop
>> > > > >> > > >>>>>> > > > > -
>> > > > >> > > >>>>>> > > > > > > > HBase
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > [1]
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Description:
>> > > > >> > > >>>>>> > > > > > > > > ------------
>> > > > >> > > >>>>>> > > > > > > > > HBaseWD stands for Distributing
>> > > (sequential)
>> > > > >> > Writes.
>> > > > >> > > >>>>>> It was
>> > > > >> > > >>>>>> > > > > inspired
>> > > > >> > > >>>>>> > > > > > by
>> > > > >> > > >>>>>> > > > > > > > > discussions on HBase mailing lists
>> around
>> > > the
>> > > > >> > > problem
>> > > > >> > > >>>>>> of
>> > > > >> > > >>>>>> > > choosing
>> > > > >> > > >>>>>> > > > > > > > between:
>> > > > >> > > >>>>>> > > > > > > > > * writing records with sequential row
>> > keys
>> > > > >> (e.g.
>> > > > >> > > >>>>>> time-series
>> > > > >> > > >>>>>> > > data
>> > > > >> > > >>>>>> > > > > > with
>> > > > >> > > >>>>>> > > > > > > > row
>> > > > >> > > >>>>>> > > > > > > > > key
>> > > > >> > > >>>>>> > > > > > > > >  built based on ts)
>> > > > >> > > >>>>>> > > > > > > > > * using random unique IDs for records
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > First approach makes possible to
>> perform
>> > > fast
>> > > > >> > range
>> > > > >> > > >>>>>> scans
>> > > > >> > > >>>>>> > with
>> > > > >> > > >>>>>> > > > help
>> > > > >> > > >>>>>> > > > > > of
>> > > > >> > > >>>>>> > > > > > > > > setting
>> > > > >> > > >>>>>> > > > > > > > > start/stop keys on Scanner, but
>> creates
>> > > > single
>> > > > >> > > region
>> > > > >> > > >>>>>> server
>> > > > >> > > >>>>>> > > > > > > hot-spotting
>> > > > >> > > >>>>>> > > > > > > > > problem upon writing data (as row keys
>> go
>> > > in
>> > > > >> > > sequence
>> > > > >> > > >>>>>> all
>> > > > >> > > >>>>>> > > records
>> > > > >> > > >>>>>> > > > > end
>> > > > >> > > >>>>>> > > > > > > up
>> > > > >> > > >>>>>> > > > > > > > > written into a single region at a
>> time).
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Second approach aims for fastest
>> writing
>> > > > >> > performance
>> > > > >> > > >>>>>> by
>> > > > >> > > >>>>>> > > > > distributing
>> > > > >> > > >>>>>> > > > > > > new
>> > > > >> > > >>>>>> > > > > > > > > records over random regions but makes
>> not
>> > > > >> possible
>> > > > >> > > >>>>>> doing fast
>> > > > >> > > >>>>>> > > > range
>> > > > >> > > >>>>>> > > > > > > scans
>> > > > >> > > >>>>>> > > > > > > > > against written data.
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > The suggested approach stays in the
>> > middle
>> > > of
>> > > > >> the
>> > > > >> > > two
>> > > > >> > > >>>>>> above
>> > > > >> > > >>>>>> > and
>> > > > >> > > >>>>>> > > > > > proved
>> > > > >> > > >>>>>> > > > > > > to
>> > > > >> > > >>>>>> > > > > > > > > perform well by distributing records
>> over
>> > > the
>> > > > >> > > cluster
>> > > > >> > > >>>>>> during
>> > > > >> > > >>>>>> > > > > writing
>> > > > >> > > >>>>>> > > > > > > data
>> > > > >> > > >>>>>> > > > > > > > > while allowing range scans over it.
>> > HBaseWD
>> > > > >> > provides
>> > > > >> > > >>>>>> very
>> > > > >> > > >>>>>> > > simple
>> > > > >> > > >>>>>> > > > > API
>> > > > >> > > >>>>>> > > > > > to
>> > > > >> > > >>>>>> > > > > > > > > work with which makes it perfect to
>> use
>> > > with
>> > > > >> > > existing
>> > > > >> > > >>>>>> code.
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Please refer to unit-tests for lib
>> usage
>> > > info
>> > > > >> as
>> > > > >> > > they
>> > > > >> > > >>>>>> aimed
>> > > > >> > > >>>>>> > to
>> > > > >> > > >>>>>> > > > act
>> > > > >> > > >>>>>> > > > > as
>> > > > >> > > >>>>>> > > > > > > > > example.
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Brief Usage Info (Examples):
>> > > > >> > > >>>>>> > > > > > > > > ----------------------------
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Distributing records with sequential
>> keys
>> > > > which
>> > > > >> > are
>> > > > >> > > >>>>>> being
>> > > > >> > > >>>>>> > > written
>> > > > >> > > >>>>>> > > > > in
>> > > > >> > > >>>>>> > > > > > up
>> > > > >> > > >>>>>> > > > > > > > to
>> > > > >> > > >>>>>> > > > > > > > > Byte.MAX_VALUE buckets:
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    byte bucketsCount = (byte) 32; //
>> > > > >> distributing
>> > > > >> > > into
>> > > > >> > > >>>>>> 32
>> > > > >> > > >>>>>> > > buckets
>> > > > >> > > >>>>>> > > > > > > > >    RowKeyDistributor keyDistributor =
>> > > > >> > > >>>>>> > > > > > > > >                           new
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > RowKeyDistributorByOneBytePrefix(bucketsCount);
>> > > > >> > > >>>>>> > > > > > > > >    for (int i = 0; i < 100; i++) {
>> > > > >> > > >>>>>> > > > > > > > >      Put put = new
>> > > > >> > > >>>>>> > > > > >
>> > > Put(keyDistributor.getDistributedKey(originalKey));
>> > > > >> > > >>>>>> > > > > > > > >      ... // add values
>> > > > >> > > >>>>>> > > > > > > > >      hTable.put(put);
>> > > > >> > > >>>>>> > > > > > > > >    }
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Performing a range scan over written
>> data
>> > > > >> > > (internally
>> > > > >> > > >>>>>> > > > > <bucketsCount>
>> > > > >> > > >>>>>> > > > > > > > > scanners
>> > > > >> > > >>>>>> > > > > > > > > executed):
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    Scan scan = new Scan(startKey,
>> > stopKey);
>> > > > >> > > >>>>>> > > > > > > > >    ResultScanner rs =
>> > > > >> > > >>>>>> DistributedScanner.create(hTable, scan,
>> > > > >> > > >>>>>> > > > > > > > > keyDistributor);
>> > > > >> > > >>>>>> > > > > > > > >    for (Result current : rs) {
>> > > > >> > > >>>>>> > > > > > > > >      ...
>> > > > >> > > >>>>>> > > > > > > > >    }
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Performing mapreduce job over written
>> > data
>> > > > >> chunk
>> > > > >> > > >>>>>> specified by
>> > > > >> > > >>>>>> > > > Scan:
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    Configuration conf =
>> > > > >> > HBaseConfiguration.create();
>> > > > >> > > >>>>>> > > > > > > > >    Job job = new Job(conf,
>> > > > "testMapreduceJob");
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    Scan scan = new Scan(startKey,
>> > stopKey);
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >>  TableMapReduceUtil.initTableMapperJob("table",
>> > > > >> > > >>>>>> scan,
>> > > > >> > > >>>>>> > > > > > > > >      RowCounterMapper.class,
>> > > > >> > > >>>>>> ImmutableBytesWritable.class,
>> > > > >> > > >>>>>> > > > > > > Result.class,
>> > > > >> > > >>>>>> > > > > > > > > job);
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    // Substituting standard
>> > > TableInputFormat
>> > > > >> which
>> > > > >> > > was
>> > > > >> > > >>>>>> set in
>> > > > >> > > >>>>>> > > > > > > > >    //
>> > > > >> TableMapReduceUtil.initTableMapperJob(...)
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > >  job.setInputFormatClass(WdTableInputFormat.class);
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >>  keyDistributor.addInfo(job.getConfiguration());
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > Extending Row Keys Distributing
>> Patterns:
>> > > > >> > > >>>>>> > > > > > > > >
>> -----------------------------------------
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > > HBaseWD is designed to be flexible and
>> to
>> > > > >> support
>> > > > >> > > >>>>>> custom row
>> > > > >> > > >>>>>> > > key
>> > > > >> > > >>>>>> > > > > > > > > distribution
>> > > > >> > > >>>>>> > > > > > > > > approaches. To define custom row key
>> > > > >> distributing
>> > > > >> > > >>>>>> logic just
>> > > > >> > > >>>>>> > > > > > implement
>> > > > >> > > >>>>>> > > > > > > > > AbstractRowKeyDistributor abstract
>> class
>> > > > which
>> > > > >> is
>> > > > >> > > >>>>>> really very
>> > > > >> > > >>>>>> > > > > simple:
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > > >    public abstract class
>> > > > >> AbstractRowKeyDistributor
>> > > > >> > > >>>>>> implements
>> > > > >> > > >>>>>> > > > > > > > > Parametrizable {
>> > > > >> > > >>>>>> > > > > > > > >      public abstract byte[]
>> > > > >> > getDistributedKey(byte[]
>> > > > >> > > >>>>>> > > > originalKey);
>> > > > >> > > >>>>>> > > > > > > > >      public abstract byte[]
>> > > > >> getOriginalKey(byte[]
>> > > > >> > > >>>>>> > adjustedKey);
>> > > > >> > > >>>>>> > > > > > > > >      public abstract byte[][]
>> > > > >> > > >>>>>> getAllDistributedKeys(byte[]
>> > > > >> > > >>>>>> > > > > > > originalKey);
>> > > > >> > > >>>>>> > > > > > > > >      ... // some utility methods
>> > > > >> > > >>>>>> > > > > > > > >    }
>> > > > >> > > >>>>>> > > > > > > > >
>> > > > >> > > >>>>>> > > > > > > >
>> > > > >> > > >>>>>> > > > > > >
>> > > > >> > > >>>>>> > > > > >
>> > > > >> > > >>>>>> > > > >
>> > > > >> > > >>>>>> > > >
>> > > > >> > > >>>>>> > >
>> > > > >> > > >>>>>> >
>> > > > >> > > >>>>>>
>> > > > >> > > >>>>>
>> > > > >> > > >>>>>
>> > > > >> > > >>>>
>> > > > >> > > >>>
>> > > > >> > > >>
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message