Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Sat, 21 Jun 2014 11:59:27 -0400
From: Jeremy Kepner <kepner@ll.mit.edu>
To: <dev@accumulo.apache.org>
Subject: Re: better presplitting
Message-ID: <20140621155927.GA63979@ll.mit.edu>
Reply-To: <kepner@ll.mit.edu>
References: <hj2wc792g9jh88c8fi311p2n.1403322734089@email.android.com>
 <CAGUtCHqL4PcBPs8wu8DJvfSd31z0h0TQbvois8J71sFQhapR0g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: 
 <CAGUtCHqL4PcBPs8wu8DJvfSd31z0h0TQbvois8J71sFQhapR0g@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)

I would encourage the community to figure this our for the following reason.
As other databases adopt Accumulo's security features, Accumulo's
primary feature is performance.
Other NoSQL databases have let performance slide in favor of adding more features.
The gap between Accumulo performance and other NoSQL databases is growing.
There are many applications where Accumulo can do on one node what it would
take 20 or more nodes to do using another technology.
That said, the SQL and NewSQL communities have not been idle and
their are some fairly high performance competitors out there.
In the future, I believe Accumulo's primary performance competition
will come from the SQL and NewSQL communities.

The key to performance is optimization.  The key to optimization
is how quickly you can do a performance measurement.  The IEEE HPEC
paper was able to get its results because we are able to collect
an accurate performance number at scale in a few minutes.
However, for the largest results, pre-splitting took almost an hour.
If we are able to remove the pre-splitting bottleneck we will
be able to very quickly test performance at scale which will
allow us to maintain Accumulo's impressive performance.

My $0.02

P.S. I should add that the next biggest issue was the WAL, which
we had to turn off because it made things unstable at extreme
insert rate.  I think if we solve the pre-splitting issue
it will be a lot easier to attack the WAL issue.


On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote:
> On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <ivan.bella@comcast.net> wrote:
> 
> > Right...pre splitting more gradually might be worthwhile...
> >
> 
> Yeah, If balancing is a problem adding 128 splits that are evenly
> distributed and letting those spread would probably help alot.  After the
> 128 spread then add the rest.
> 
> I did the following in 1.4.0 and was able to add 100,000 splits in ~4mins
> using 16 threads.  I think i merged this code into 1.4.0 with a default of
> 16 threads.  I wonder what has changed.  This is an example of another
> targeted performance test we need to check for regressions.
> 
> https://github.com/keith-turner/Accumulo-Parallel-Splitter
> 
> In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be
> contributing to some of the slowness.  Each split does 2 synchronous writes
> to the metadata table, which results in an hsync.  If hsync takes 50 ms and
> there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624 seconds.
>  However w/ group commit not working properly, these numbers may be worse
> as all of the parallel writes to metadata from tservers splitting would
> have to wait on each other.
> 
> 
> 
> >
> > <div>-------- Original message --------</div><div>From: dlmarion <
> > dlmarion@comcast.net> </div><div>Date:06/20/2014  7:26 PM  (GMT-05:00)
> > </div><div>To: dev@accumulo.apache.org </div><div>Subject: Re: better
> > presplitting </div><div>
> > </div>We have always had issues with splitting taking a long time. Its a
> > serial process that has to compete with the balancer for a lock on the
> > metadata table. At least in 1.4 anyway, my information may be outdated.
> > Trying to add threads to create splits in parallel was never faster. It
> > would be nice if you could manually acquire a lock on the metadata table in
> > the shell, add all your split points, then release the lock and let the
> > tservers figure it out. In this case you could parallelize the splitting by
> > avoiding splitting the last tablet, but split at the midpoint of the last
> > tablet and last split.
> >
> >
> >
> > <div>-------- Original message --------</div><div>From: Josh Elser <
> > josh.elser@gmail.com> </div><div>Date:06/20/2014  6:33 PM  (GMT-05:00)
> > </div><div>To: dev@accumulo.apache.org </div><div>Subject: Re: better
> > presplitting </div><div>
> > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <busbey@cloudera.com> wrote:
> > >
> > > When you add splits, they definitely start out on the server that is
> > > hosting the tablet that has to split apart.  They have to, since the
> > tablet
> > > that hosted the previous key extent is the only one that can properly
> > > handle requests for the new key extents.
> > >
> > > We've run into this consistently when doing any testing that requires
> > > pre-splitting for perf reasons.
> >
> > I'd have to pull up the split code, but it seems like a simple fix could be
> > to let all but one result of the split of a tablet remain local. That way
> > the current server doesn't get bogged down, and the master would just use
> > the regular assignment path instead of waiting for the balancer to kick in.
> >
> > Maybe there's a reason this doesn't work though :)
> >
> > > In the case of YCSB tests, Mike scripted some nice manual pre-splitting
> > in
> > > waves:
> > >
> > > * split table into X parts
> > > * wait for balancing
> > > * split each X part into Y parts
> > > * wait for balancing
> > >
> > > presuming the goal is to end up with X*Y presplits, this was way faster
> > > than just asking for the total right off the bat.
> > >
> > > We could generally look at improving the migration code to handle these
> > > reassignments faster, but how often does this situation come up for
> > people
> > > who aren't making a new table? If the "do this offline" feature speeds up
> > > the new table use case enough, I'm not sure optimizing the migration path
> > > is worth the time investment right now.
> > >
> > >
> > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <josh.elser@gmail.com>
> > wrote:
> > >
> > > > bq. They all started out on one server
> > > >
> > > > This seems.. weird. Would be good to start addressing this by
> > identifying
> > > > what the actual balancer code does so we can immediately start to test
> > the
> > > > assertions. We can then use the results to identify the deficiencies
> > that
> > > > exist.
> > > >
> > > > I think the 200splits per server was an Eric quote from some time ago
> > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of things,
> > > > workload and memory available most notably, and would be good to
> > quantify
> > > > too.
> > > >
> > > >
> > > > On 6/20/14, 11:58 AM, Sean Busbey wrote:
> > > >
> > > >> One thing that jumped out from the most recent D4M paper was this
> > quote:
> > > >>
> > > >>    One issue that was encountered is that after creating the
> > pre-splits,
> > > >> they all started out on one server. Accumulo load balanced the splits
> > > >> across its servers at rate of ~50 splits/second, which is more than
> > > >> adequate for normal operation, but can take ~20 minutes for 50,000
> > pre-
> > > >> splits.[1]
> > > >>
> > > >> Do we already have an open ticket that would help this? I think maybe
> > > >> there's one about being able to presplit a table that is offline?
> > > >>
> > > >> I believe our recommended sweet spot is like 100-200 tablets per
> > server
> > > >> (though I can't find the reference for *why* I believe this ATM),
> > which
> > > >> means for clusters in the ~100s of nodes this would be in the ballpark
> > for
> > > >> an expected number of pre-splits.
> > > >>
> > > >>
> > > >> [1]:  arXiv:1406.4923v1 [cs.DB]
> > > >>
> > > >>
> > >
> > >
> > > --
> > > Sean
> >