accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Troxell <steven.trox...@gmail.com>
Subject BatchScanner vs. Scanner logic
Date Sun, 12 Aug 2012 21:49:54 GMT
Hi All,

I was wondering if someone would be willing to help evaulate my reasoning
on the use of Scanner vs. BatchScanner, and see if I'm making the proper
assumptions.

The background is I am attempting to benchmark an RDF application using
Accumulo by evaluating the impact of scaling on performance (measured by
query return time).

The scan patterns currently use the Scanner class,  and gets a single row
of data.  The table design/implementation is such that there is never a
need to simultaneously scan multiple non-adjacent rows.   One query from
the GUI, should effectively result in a one-time single range scan. The
size of data return varies widely, as small as 10 to say millions of
results.  The return order is not significant.

Reading the API suggests:  ". If you want to lookup a few ranges and expect
those ranges to contain a lot of data, then use the Scanner instead"   and
the use of BatchScanner should be reserved for cases of simultaneously
wanting to use multiple ranges.  It additionally feels weird to be using
batchscan on a "collection" of 1 range.

That said,  my performance so far shows scaling is not adding much, 6
machines is the max performance of getting, with drops in performance over
that amount.  This contradicts the theoretical linear  improvement I should
be seeing.   To my understanding, BatchScanning scans the Tservers in
parallel, Scanner does not.    Would it be reasonable to expect using
BatchScanner would allow to see the effects of scaling closer to what they
should be?

My logic here is that I have X rows spread out across 10 machines. Right
now whether I'm using 1 machine or 10 machines it is iteratively scanning
allow rows.   If I batchscanned would I be guaranteed to minimize the time
to result of that of a lookup on 1 machine, instead of the average case of
5 machines, or worst case of 10 (assuming uniform data distribution and
various other assumptions).

I'm mainly just questioning this, because the API info on Scan/BatchScan,
suggests Scan is the desired choice for my application, but I don't see how
switching to Batchscan, even if I'm perhaps not utilizing it as intended,
wouldn't improve scaling potential.

Thanks in advance for thoughts/insight.
-Steve

Mime
View raw message