hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: Tips on pre-splitting
Date Tue, 29 Mar 2011 22:11:02 GMT
 last key.  This last can probably best be done as a conventional script
> after you have knocked down the data to small size.

Yeah, the split command comes to mind.

> Note that most of your joins can just go away since all you want are the
> keys.

Sure, but you need to make sure that you're still producing a RowKey
for each key/value, as opposed to a distinct set of RowKeys, right?
This is to make sure you still take into account an uneven
distribution of cells as well as RowKeys.


On Tue, Mar 29, 2011 at 2:56 PM, Ted Dunning <tdunning@maprtech.com> wrote:
> It should be pretty easy to down-sample the data to have no more than
> 1000-10,000 keys.  Sort those and take every n-th key omitting the first and
> last key.  This last can probably best be done as a conventional script
> after you have knocked down the data to small size.
> Note that most of your joins can just go away since all you want are the
> keys.
>
> On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgraham@gmail.com> wrote:
>>
>> 1. use Pig to read in our datasets, join/filter/transform/etc before
>> writing the output back to HDFS with N reducers ordered by key, where
>> N is the number of splits we'll create.
>> 2. Manually plucking out the first key of each reducer output file to
>> make a list of split keys.
>

Mime
View raw message