accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: How to choose BinId for Document partitioned index
Date Sat, 06 Feb 2016 19:20:15 GMT
You can get *really* fancy if you have lots of ingesters and lots of 
servers, include some attribute in the data you're hashing to control 
how many servers a given client will need to write to for some batch of 
documents. This is probably overkill for most setups though.

Guava provides a decent murmur3 implementation which will be much faster 
than your run-of-the-mill MD5 for generating the hash (which you'll mod 
by the max number of bins).

William Slacum wrote:
> Often it'll be a hash of the document mod the number of bins you're
> using. The hash should be "good" in the sense that it uniquely
> identifies the document. It can be as simple as some unique field in the
> document or just a hash (like murmur) of the whole document.
>
> On Saturday, February 6, 2016, Jamie Johnson <jej2003@gmail.com
> <mailto:jej2003@gmail.com>> wrote:
>
>     Just found this excellent write up that explains a bit.
>
>     https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
>
>     On Feb 6, 2016 8:52 AM, "Jamie Johnson" <jej2003@gmail.com
>     <javascript:_e(%7B%7D,'cvml','jej2003@gmail.com');>> wrote:
>
>         Reading the examples for table design I've come across a
>         question associated with the document partitioned index,
>         specifically what is typically chosen as the BinId or maybe more
>         appropriately what factors should influence what is chosen as
>         the BinId and what impact do they have?
>

Mime
View raw message