hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Mogenet <adrien.moge...@gmail.com>
Subject Re: Questions on FuzzyRowFilter
Date Sat, 03 May 2014 08:10:22 GMT
Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
split enough among all the possible regions, but you won't be able to
easily benefit from distributed scans to gather what you want.

Let say you want to split (time+login) with a salted key and you expect to
be able to retrieve events from 20140429 pretty fast. Then I would split
input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
% 10'). To retrieve ordered data, I would parallelize Scans over the 10
span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
until I've got all the expected results.

So in term of performances this looks "a little bit" faster than your 2^32

On Fri, May 2, 2014 at 10:09 PM, Software Dev <static.void.dev@gmail.com>wrote:

> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
> time series data (20140501, 20140502...).  We can prefix all of the
> keys with 4 random bytes and then just skip these during scanning. Is
> that correct? These *seems* like it will work but Im questioning the
> performance of this even if it does work.
> Also, is this available via the rest client, shell and/or thrift client?
> Also, is there a FuzzyColumn equivalent of this feature?

Adrien Mogenet

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message