hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: Is there a problem with having 4000 tables in a cluster?
Date Wed, 25 Sep 2013 00:57:07 GMT
Since different people use different terms... Salting is BAD. (You need to understand what
is implied by the term salt.)

What you really want to do is take the hash of the key, and then truncate the hash. Use that
instead of a salt.

Much better than a salt.

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <varun@pinterest.com> wrote:
> Its better to do some "salting" in your keys for the reduce phase.
> Basically, make ur key be something like "KeyHash + Key" and then decode it
> in your reducer and write to HBase. This way you avoid the hotspotting
> problem on HBase due to MapReduce sorting.
> On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>> Hi Jeremy,
>> I don't see any issue for HBase to handle 4000 tables. However, I don't
>> think it's the best solution for your use case.
>> JM
>> 2013/9/24 jeremy p <athomewithagroovebox@gmail.com>
>>> Short description : I'd like to have 4000 tables in my HBase cluster.
>> Will
>>> this be a problem?  In general, what problems do you run into when you
>> try
>>> to host thousands of tables in a cluster?
>>> Long description : I'd like the performance advantage of pre-split
>> tables,
>>> and I'd also like to do filtered range scans.  Imagine a keyspace where
>> the
>>> key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
>> to
>>> 4000, and WORD is a string consisting of 96 characters.  The value in the
>>> cell would be a single integer.  My app will examine a 'document', where
>>> each 'line' consists of 4000 WORDs.  For each WORD, it'll do a filtered
>>> regex lookup.  Only problem?  Say I have 200 mappers and they all start
>> at
>>> POSITION 1, my region servers would get hotspotted like crazy. So my idea
>>> is to break it into 4000 tables (one for each POSITION), and then
>> pre-split
>>> the tables such that each region gets an equal amount of the traffic.  In
>>> this scenario, the key would just be WORD.  Dunno if this a bad idea,
>> would
>>> be open to suggestions
>>> Thanks!
>>> --J

View raw message