accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Hardcastle <hardcastle.d...@gmail.com>
Subject Re: Mini Accumulo cluster
Date Thu, 14 May 2015 20:51:18 GMT
Josh,

Thanks for your response.

My iterators will do the same number of seeks, they're only different in
the implementation of the functions used to perform filtering, so I think
I'll get a reasonable comparison but I won't read too much into the results.


On 13 May 2015 at 21:19, Josh Elser <josh.elser@gmail.com> wrote:

> As long as you're managing your expectations (which I sounds like you've
> considered well), there could be some worth.
>
> A concern would be how using a different filesystem implementation
> actually impacts the validity of your benchmark though.
>
> e.g. w/ a local FS (which is by default what MAC does), a disk seek costs
> 10ms, but using your real HDFS cluster, it's 200ms. IteratorA does more
> seeks but is less efficient on the retrieved data while IteratorB does
> fewer seeks but is more efficient on the retrieved data would lead to
> inaccurate benchmarks on a production system.
>
> I guess another way to put it is that total wall time for a query might be
> deceiving in a test environment.
>
>
> Dave Hardcastle wrote:
>
>> Hi,
>>
>> Is it crazy to use a MiniAccumuloCluster to measure the *relative*
>> performance of two different implementations of iterators?
>>
>> Obviously it would be better to do it on a real Accumulo cluster, but
>> that's not possible for several reasons.
>>
>> The approach would be something like:
>> - Fire up a Mini cluster
>> - Bulk import a file
>> - Start timer
>> - Set up a BatchScanner with one of the iterator stacks and use it to
>> query for lots of different ranges
>> - Iterate through the results of this
>> - Stop timer
>>
>> Repeat with the other implementation of the iterators.
>>
>> Of course, the difference in performance may not be measurable, if the
>> time is dominated by the disk-seek time, but that would still be useful
>> information. And the absolute performance wouldn't be representative of
>> what you'd get on a real cluster as there's no network latency in these
>> trials, but that's fine as I'm mainly interested in which of the two
>> implementations of the iterators is most performant.
>>
>> Similarly, could the same approach be used to compare the performance on
>> SSD vs hard disk?
>>
>> Thanks,
>>
>> Dave.
>>
>>

Mime
View raw message