Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Message-ID: <55BA5DDB.1000300@gmail.com>
Date: Thu, 30 Jul 2015 13:24:43 -0400
From: Josh Elser <josh.elser@gmail.com>
User-Agent: Postbox 3.0.11 (Macintosh/20140602)
MIME-Version: 1.0
To: user@accumulo.apache.org
Subject: Re: Entry-based TableBalancer
References: <4dngtp202cqgunwk6tkjj1cd.1438218372168@email.android.com>
 <CAJjD1ePnNs8LZjU4zYmz_jY415yshL7ZwKtu4cjEpWx4p4Rqwg@mail.gmail.com>
 <55B9A59D.7040209@orkash.com>
 <CAJjD1eNFhEUf=XPJZDQfkPJKSKfF7Dk5O2i_yqvkJ8sVjyoLcg@mail.gmail.com>
In-Reply-To: 
 <CAJjD1eNFhEUf=XPJZDQfkPJKSKfF7Dk5O2i_yqvkJ8sVjyoLcg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


Konstantin Pelykh wrote:
> Thanks for a suggestion, bellow are some details explaining the reason
> for such balancer:
> I'm basing my application on accumulo-wikipedia example, so there can be
> multiple partitions per tablet. Some partitions are larger others are
> smaller.

Are you talking about the "sharded" table or the "inverted index" table? 
Assuming you mean the "sharded" table (given your mention of 
partitions), a skew here implies a poor choice of a partitioning 
algorithm. How are you choosing the partitions at ingest time? 
Hash-based? Something else?

A good hash used to generate your partitions at ingest time should 
prevent such skew at query time.

There's a possibility to split partition range manually afger
> ingestion is complete and rely on default balancer to spread tablets
> accross cluster, however in this case some servers end up overloaded
> compared to others.
> Currently the slowest server (hosting the largest tablet) defines final
> time for search query, so I want to distribute entities accorss the
> cluster so that they are well balanced and all servers spend simillir
> amount of time processing documents though OptimizedQueryIterators.
>
> Konstantin
> --------
> Big Data / Search Consultant
> LinkedIn: linkedin.com/in/kpelykh <http://www.linkedin.com/in/kpelykh>
> Website: www.kpelykh.com <http://www.kpelykh.com>
>
> On Wed, Jul 29, 2015 at 9:18 PM, mohit.kaushik <mohit.kaushik@orkash.com
> <mailto:mohit.kaushik@orkash.com>> wrote:
>
>     If I am not getting you wrong, for this purpose, you can simply
>     pre-split tables based on range to evenly distribute data across
>     tablets.
>     https://accumulo.apache.org/1.7/accumulo_user_manual.html#_pre_splitting_tables
>
>
>
>
>     On 07/30/2015 07:46 AM, Konstantin Pelykh wrote:
>>     In this specific case, ingest happens only once. It's write-once,
>>     read-many type of application, so with such balancer I would want
>>     to balance tablets based on number of entities after ingest is
>>     fully complete.
>>
>>     --------
>>     Big Data / Search Consultant
>>     Cell: +1 (646) 639-3916
>>     E-mail: kpelykh@gmail.com <mailto:kpelykh@gmail.com>
>>     LinkedIn: linkedin.com/in/kpelykh <http://www.linkedin.com/in/kpelykh>
>>     Website: www.kpelykh.com <http://www.kpelykh.com>
>>
>>     On Wed, Jul 29, 2015 at 6:06 PM, dlmarion <dlmarion@comcast.net
>>     <mailto:dlmarion@comcast.net>> wrote:
>>
>>         Hotspotting was the first thing that came to my mind with the
>>         proposed balancer. The fservers don't keep all the K/V in
>>         memory. You are balancing query and live ingest across your
>>         resources.
>>
>>
>>
>>
>>
>>         -------- Original message --------
>>         From: Eric Newton <eric.newton@gmail.com
>>         <mailto:eric.newton@gmail.com>>
>>         Date: 07/29/2015 8:46 PM (GMT-05:00)
>>         To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>>         Subject: Re: Entry-based TableBalancer
>>
>>         To my knowledge, nobody has written such a balancer.
>>
>>         In the history of the project, we started writing advanced,
>>         complicated balancers that moved tablets around much too
>>         quickly, which degraded performance. After that, we wrote much
>>         simpler balancers to avoid the chaos. We're moving back to
>>         more complex balancers, but mostly just to ensure that we
>>         aren't hotspoting, based on known ingest patterns (date
>>         related, for example).
>>
>>         If you write a new balancer, make it slow to move tablets, and
>>         very simple.  Avoid over-optimizing tablet placement.
>>
>>         -Eric
>>
>>         On Wed, Jul 29, 2015 at 8:20 PM, Konstantin Pelykh
>>         <kpelykh@gmail.com <mailto:kpelykh@gmail.com>> wrote:
>>
>>             Hi,
>>
>>             I'm looking for a tablet balancer which operates based on
>>             a number of entries per tablet as opposed to a number of
>>             tablets per tablet server. My goal is to get even
>>             distribution of entries across the cluster.
>>
>>             As an example:
>>
>>             tablet #1  15M entries
>>             tablet #2   5M entries
>>             tablet #3   8M entries
>>
>>             After balancing tablets I would want to get:
>>
>>             Server 1 hosts: tablet1
>>             Server 2 hosts: tablet2, tablet3
>>
>>             The idea is pretty simple and I believe such balancer has
>>             already been developed, so I decided to check before
>>             reinventing the wheel.
>>
>>             Thanks!
>>             Konstantin
>>
>>             --------
>>             Big Data / Lucene and Solr Consultant
>>             LinkedIn: linkedin.com/in/kpelykh
>>             <http://www.linkedin.com/in/kpelykh>
>>             Website: www.kpelykh.com <http://www.kpelykh.com>
>>
>>
>>
>
>
>     --
>
>     *Mohit Kaushik*
>     Software Engineer
>     A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>     *Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553
>
>     <http://politicomapper.orkash.com>interactive social intelligence at
>     work...
>
>     <https://www.facebook.com/Orkash2012>
>     <http://www.linkedin.com/company/orkash-services-private-limited>
>     <https://twitter.com/Orkash> <http://www.orkash.com/blog/>
>     <http://www.orkash.com>
>     <http://www.orkash.com> ... ensuring Assurance in complexity and
>     uncertainty
>
>     /This message including the attachments, if any, is a confidential
>     business communication. If you are not the intended recipient it may
>     be unlawful for you to read, copy, distribute, disclose or otherwise
>     use the information in this e-mail. If you have received it in error
>     or are not the intended recipient, please destroy it and notify the
>     sender immediately. Thank you /
>
>