Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7444711CEC for ; Tue, 20 May 2014 17:32:21 +0000 (UTC) Received: (qmail 7731 invoked by uid 500); 20 May 2014 17:32:21 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 7682 invoked by uid 500); 20 May 2014 17:32:21 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 7674 invoked by uid 99); 20 May 2014 17:32:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 May 2014 17:32:21 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.elser@gmail.com designates 209.85.192.49 as permitted sender) Received: from [209.85.192.49] (HELO mail-qg0-f49.google.com) (209.85.192.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 May 2014 17:32:14 +0000 Received: by mail-qg0-f49.google.com with SMTP id a108so1277263qge.22 for ; Tue, 20 May 2014 10:31:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=1kKOKyhWbHbRLv7KjWh8nSujGZpVh6m5oFtrixS3198=; b=A0KFdLlMfHfcfJhfHuVHX19sSSSx95ZDP1YkQmIhJ2/apdMV9fs8oMP/yuKBm7ZqQO RJno2AanIms4QWN2HRxyI99Q0jjfImDUDQNqcaHoXzSiE1MOmMKcglakzRlHjjcfQkVE Tx02GhfhgYYf63rzCfpILEjGDhHNf6qU0L3nlAPiWs2X0aT8Fino+VqacuGNJzaWxNyz 1gp3tB6yGqg92a5l13AV96XfSy05nX/J3Io7LfhAYYV3Jfsfu8F/tYeNnrCIQmhQc9fU l8qIWzzyXEHEG9SgP58Sb6KHHuME/Gzpmdn2uldfk8OWgXn9ASpqf8us5mhlikFBWy7l IeTA== X-Received: by 10.224.10.195 with SMTP id q3mr59193595qaq.65.1400607114046; Tue, 20 May 2014 10:31:54 -0700 (PDT) Received: from HW10447.local (pool-71-166-48-47.bltmmd.fios.verizon.net. [71.166.48.47]) by mx.google.com with ESMTPSA id a6sm33588794qaj.15.2014.05.20.10.31.53 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 20 May 2014 10:31:53 -0700 (PDT) Message-ID: <537B9188.9010302@gmail.com> Date: Tue, 20 May 2014 13:31:52 -0400 From: Josh Elser User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: Improving Batchscanner Performance References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi David, Absolutely. What you have here is a classic producer-consumer model. Your BatchScanner is producing results, which you then consume by your scanner, and ultimately return those results to the client. The problem with your below implementation is that you're not going to be polling your batchscanner as aggressively as you could be. You are blocking while you can fetch each of those new Ranges from the Scanner before fetching new ranges. Have you considered splitting up the BatchScanner and Scanner code into two different threads? You could easily use a ArrayBlockingQueue (or similar) to pass results from the BatchScanner to the Scanner. I would imagine that this would give you a fair improvement in performance. Also, it doesn't appear that there's a reason you can't use a BatchScanner for both lookups? One final warning, your current implementation could also hog heap very badly if your batchscanner returns too many records. The producer/consumer I proposed should help here a little bit, but you should still be asserting upper-bounds to avoid running out of heap space in your client. On 5/20/14, 1:10 PM, Slater, David M. wrote: > Hey everyone, > > I'm trying to improve the query performance of batchscans on my data table. I first scan over index tables, which returns a set of rowIDs that correspond to the records I am interested in. This set of records is fairly randomly (and uniformly) distributed across a large number of tablets, due to the randomness of the UID and the query itself. Then I want to scan over my data table, which is setup as follows: > row colFam colQual value > rowUID -- -- byte[] of data > > These records are fairly small (100s of bytes), but numerous (I may return 50000 or more). The method I use to obtain this follows. Essentially, I turn the rows returned from the first query into a set of ranges to input into the batchscanner, and then return those rows, retrieving the value from them. > > // returns the data associated with the given collection of rows > public Collection getRowData(Collection rows, Text dataType, String tablename, int queryThreads) throws TableNotFoundException { > List values = new ArrayList(rows.size()); > if (!rows.isEmpty()) { > BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(), queryThreads); > List ranges = new ArrayList(); > for (Text row : rows) { > ranges.add(new Range(row)); > } > scanner.setRanges(ranges); > for (Map.Entry entry : scanner) { > values.add(entry.getValue().get()); > } > scanner.close(); > } > return values; > } > > Is there a more efficient way to do this? I have index caches and bloom filters enabled (data caches are not), but I still seem to have a long query lag. Any thoughts on how I can improve this? > > Thanks, > David >