hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [ANN]: HBaseWD: Distribute Sequential Writes in HBase
Date Wed, 20 Apr 2011 05:17:22 GMT
Interesting project, Alex.
Since there're bucketsCount scanners compared to one scanner originally,
have you performed load testing to see the impact ?

Thanks

On Tue, Apr 19, 2011 at 10:25 AM, Alex Baranau <alex.baranov.v@gmail.com>wrote:

> Hello guys,
>
> I'd like to introduce a new small java project/lib around HBase: HBaseWD.
> It
> is aimed to help with distribution of the load (across regionservers) when
> writing sequential (becasue of the row key nature) records. It implements
> the solution which was discussed several times on this mailing list (e.g.
> here: http://search-hadoop.com/m/gNRA82No5Wk).
>
> Please find the sources at https://github.com/sematext/HBaseWD (there's
> also
> a jar of current version for convenience). It is very easy to make use of
> it: e.g. I added it to one existing project with 1+2 lines of code (one
> where I write to HBase and 2 for configuring MapReduce job).
>
> Any feedback is highly appreciated!
>
> Please find below the short intro to the lib [1].
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>
> [1]
>
> Description:
> ------------
> HBaseWD stands for Distributing (sequential) Writes. It was inspired by
> discussions on HBase mailing lists around the problem of choosing between:
> * writing records with sequential row keys (e.g. time-series data with row
> key
>  built based on ts)
> * using random unique IDs for records
>
> First approach makes possible to perform fast range scans with help of
> setting
> start/stop keys on Scanner, but creates single region server hot-spotting
> problem upon writing data (as row keys go in sequence all records end up
> written into a single region at a time).
>
> Second approach aims for fastest writing performance by distributing new
> records over random regions but makes not possible doing fast range scans
> against written data.
>
> The suggested approach stays in the middle of the two above and proved to
> perform well by distributing records over the cluster during writing data
> while allowing range scans over it. HBaseWD provides very simple API to
> work with which makes it perfect to use with existing code.
>
> Please refer to unit-tests for lib usage info as they aimed to act as
> example.
>
> Brief Usage Info (Examples):
> ----------------------------
>
> Distributing records with sequential keys which are being written in up to
> Byte.MAX_VALUE buckets:
>
>    byte bucketsCount = (byte) 32; // distributing into 32 buckets
>    RowKeyDistributor keyDistributor =
>                           new
> RowKeyDistributorByOneBytePrefix(bucketsCount);
>    for (int i = 0; i < 100; i++) {
>      Put put = new Put(keyDistributor.getDistributedKey(originalKey));
>      ... // add values
>      hTable.put(put);
>    }
>
>
> Performing a range scan over written data (internally <bucketsCount>
> scanners
> executed):
>
>    Scan scan = new Scan(startKey, stopKey);
>    ResultScanner rs = DistributedScanner.create(hTable, scan,
> keyDistributor);
>    for (Result current : rs) {
>      ...
>    }
>
> Performing mapreduce job over written data chunk specified by Scan:
>
>    Configuration conf = HBaseConfiguration.create();
>    Job job = new Job(conf, "testMapreduceJob");
>
>    Scan scan = new Scan(startKey, stopKey);
>
>    TableMapReduceUtil.initTableMapperJob("table", scan,
>      RowCounterMapper.class, ImmutableBytesWritable.class, Result.class,
> job);
>
>    // Substituting standard TableInputFormat which was set in
>    // TableMapReduceUtil.initTableMapperJob(...)
>    job.setInputFormatClass(WdTableInputFormat.class);
>    keyDistributor.addInfo(job.getConfiguration());
>
>
> Extending Row Keys Distributing Patterns:
> -----------------------------------------
>
> HBaseWD is designed to be flexible and to support custom row key
> distribution
> approaches. To define custom row key distributing logic just implement
> AbstractRowKeyDistributor abstract class which is really very simple:
>
>    public abstract class AbstractRowKeyDistributor implements
> Parametrizable {
>      public abstract byte[] getDistributedKey(byte[] originalKey);
>      public abstract byte[] getOriginalKey(byte[] adjustedKey);
>      public abstract byte[][] getAllDistributedKeys(byte[] originalKey);
>      ... // some utility methods
>    }
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message