From Josh Elser <>
Subject Re: Mini Accumulo cluster
Date Wed, 13 May 2015 20:19:55 GMT
As long as you're managing your expectations (which I sounds like you've 
considered well), there could be some worth.

A concern would be how using a different filesystem implementation 
actually impacts the validity of your benchmark though.

e.g. w/ a local FS (which is by default what MAC does), a disk seek 
costs 10ms, but using your real HDFS cluster, it's 200ms. IteratorA does 
more seeks but is less efficient on the retrieved data while IteratorB 
does fewer seeks but is more efficient on the retrieved data would lead 
to inaccurate benchmarks on a production system.

I guess another way to put it is that total wall time for a query might 
be deceiving in a test environment.

Dave Hardcastle wrote:
> Hi,
> Is it crazy to use a MiniAccumuloCluster to measure the *relative*
> performance of two different implementations of iterators?
> Obviously it would be better to do it on a real Accumulo cluster, but
> that's not possible for several reasons.
> The approach would be something like:
> - Fire up a Mini cluster
> - Bulk import a file
> - Start timer
> - Set up a BatchScanner with one of the iterator stacks and use it to
> query for lots of different ranges
> - Iterate through the results of this
> - Stop timer
> Repeat with the other implementation of the iterators.
> Of course, the difference in performance may not be measurable, if the
> time is dominated by the disk-seek time, but that would still be useful
> information. And the absolute performance wouldn't be representative of
> what you'd get on a real cluster as there's no network latency in these
> trials, but that's fine as I'm mainly interested in which of the two
> implementations of the iterators is most performant.
> Similarly, could the same approach be used to compare the performance on
> SSD vs hard disk?
> Thanks,
> Dave.

