Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com
 designates 209.85.192.49 as permitted sender)
Message-ID: <537B9188.9010302@gmail.com>
Date: Tue, 20 May 2014 13:31:52 -0400
From: Josh Elser <josh.elser@gmail.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: user@accumulo.apache.org
Subject: Re: Improving Batchscanner Performance
References: 
 <AC78983C72177B4D9D1C14F7F4AEBA2144C9A23936@aplesstripe.dom1.jhuapl.edu>
In-Reply-To: 
 <AC78983C72177B4D9D1C14F7F4AEBA2144C9A23936@aplesstripe.dom1.jhuapl.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi David,

Absolutely. What you have here is a classic producer-consumer model. 
Your BatchScanner is producing results, which you then consume by your 
scanner, and ultimately return those results to the client.

The problem with your below implementation is that you're not going to 
be polling your batchscanner as aggressively as you could be. You are 
blocking while you can fetch each of those new Ranges from the Scanner 
before fetching new ranges. Have you considered splitting up the 
BatchScanner and Scanner code into two different threads?

You could easily use a ArrayBlockingQueue (or similar) to pass results 
from the BatchScanner to the Scanner. I would imagine that this would 
give you a fair improvement in performance.

Also, it doesn't appear that there's a reason you can't use a 
BatchScanner for both lookups?

One final warning, your current implementation could also hog heap very 
badly if your batchscanner returns too many records. The 
producer/consumer I proposed should help here a little bit, but you 
should still be asserting upper-bounds to avoid running out of heap 
space in your client.

On 5/20/14, 1:10 PM, Slater, David M. wrote:
> Hey everyone,
>
> I'm trying to improve the query performance of batchscans on my data table. I first scan over index tables, which returns a set of rowIDs that correspond to the records I am interested in. This set of records is fairly randomly (and uniformly) distributed across a large number of tablets, due to the randomness of the UID and the query itself. Then I want to scan over my data table, which is setup as follows:
> row     		colFam      	colQual     	value
> rowUID  	 --          		--          		byte[] of data
>
> These records are fairly small (100s of bytes), but numerous (I may return 50000 or more). The method I use to obtain this follows. Essentially, I turn the rows returned from the first query into a set of ranges to input into the batchscanner, and then return those rows, retrieving the value from them.
>
> // returns the data associated with the given collection of rows
>      public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType, String tablename, int queryThreads) throws TableNotFoundException {
>          List<byte[]> values = new ArrayList<byte[]>(rows.size());
>          if (!rows.isEmpty()) {
>              BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(), queryThreads);
>              List<Range> ranges = new ArrayList<Range>();
>              for (Text row : rows) {
>                  ranges.add(new Range(row));
>              }
>              scanner.setRanges(ranges);
>              for (Map.Entry<Key, Value> entry : scanner) {
>                  values.add(entry.getValue().get());
>              }
>              scanner.close();
>          }
>          return values;
>      }
>
> Is there a more efficient way to do this? I have index caches and bloom filters enabled (data caches are not), but I still seem to have a long query lag. Any thoughts on how I can improve this?
>
> Thanks,
> David
>