Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E65A919CA9 for ; Sun, 10 Apr 2016 15:25:14 +0000 (UTC) Received: (qmail 71401 invoked by uid 500); 10 Apr 2016 15:25:14 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 71344 invoked by uid 500); 10 Apr 2016 15:25:14 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 71334 invoked by uid 99); 10 Apr 2016 15:25:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Apr 2016 15:25:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 407F418028D for ; Sun, 10 Apr 2016 15:25:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.499 X-Spam-Level: *** X-Spam-Status: No, score=3.499 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_NOCONFIDENCE=0.5, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 2HtOda5qxH_9 for ; Sun, 10 Apr 2016 15:25:12 +0000 (UTC) Received: from hera.ccri.com (mail.ccri.com [50.205.35.100]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 0BD055F39A for ; Sun, 10 Apr 2016 15:25:12 +0000 (UTC) Received: from [71.206.169.221] (helo=localhost.localdomain) by hera.ccri.com with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.80.1) (envelope-from ) id 1apHEX-00018S-1V for user@accumulo.apache.org; Sun, 10 Apr 2016 11:25:05 -0400 Subject: Re: Optimize Accumulo scan speed To: user@accumulo.apache.org References: From: Andrew Hulbert Message-ID: <570A704F.1010403@ccri.com> Date: Sun, 10 Apr 2016 11:25:03 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------060808060109070905010205" This is a multi-part message in MIME format. --------------060808060109070905010205 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Mario, Are you using a Scanner or a BatchScanner? One thing we did in the past with a geohash-based schema was to prefix a shard ID in front of the geohash that allows you to involve all the tservers in the scan. You'd multiply your ranges by the number of tservers you have but if the client is not the bottleneck then it may increase your throughput. Andrew On 04/10/2016 11:05 AM, Mario Pastorelli wrote: > Hi, > > I'm currently having some scan speed issues with Accumulo and I would > like to understand why and how can I solve it. I have geographical > data and I use as primary key the day and then the geohex, which is a > linearisation of lat and lon. The reason for this key is that I always > query the data for one day but for a set of geohexes with represent a > zone, so with this schema I can scan use a single scan to read all the > data for one day with few seeks. My problem is that the scan is > painfully slow: for instance, to read 5617019 rows it takes around 17 > seconds and the scan speed is 13MB/s, less than 750k scan entries/s > and around 300 seeks. I enable the tracer and this is what I've got > > 17325+0 Dice@srv1 Dice.query > 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location > 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location > 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location > 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location > I'm not sure how to speedup the scanning. I have the following question: > - is this speed normal? > - can I involve more servers in the scan? Right now only two server > have the ranges but with a cluster of 15 machines it would be nice to > involve more of them. Is it possible? > > Thanks, > Mario > > -- > Mario Pastorelli| TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone:+41794381682 > email: mario.pastorelli@teralytics.ch > > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register > Canton Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, > Yann de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at > once if you think that it may not be intended for you and delete it > immediately. > --------------060808060109070905010205 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit Mario,

Are you using a Scanner or a BatchScanner?

One thing we did in the past with a geohash-based schema was to prefix a shard ID in front of the geohash that allows you to involve all the tservers in the scan. You'd multiply your ranges by the number of tservers you have but if the client is not the bottleneck then it may increase your throughput.

Andrew

On 04/10/2016 11:05 AM, Mario Pastorelli wrote:
Hi,

I'm currently having some scan speed issues with Accumulo and I would like to understand why and how can I solve it. I have geographical data and I use as primary key the day and then the geohex, which is a linearisation of lat and lon. The reason for this key is that I always query the data for one day but for a set of geohexes with represent a zone, so with this schema I can scan use a single scan to read all the data for one day with few seeks. My problem is that the scan is painfully slow: for instance, to read 5617019 rows it takes around 17 seconds and the scan speed is 13MB/s, less than 750k scan entries/s and around 300 seeks. I enable the tracer and this is what I've got

17325+0 Dice@srv1 Dice.query
11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
I'm not sure how to speedup the scanning. I have the following question:
  - is this speed normal?
  - can I involve more servers in the scan? Right now only two server have the ranges but with a cluster of 15 machines it would be nice to involve more of them. Is it possible?

Thanks,
Mario

--
Mario Pastorelli | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone:
+41794381682
email: mario.pastorelli@teralytics.ch

www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.


--------------060808060109070905010205--