Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA93619C8F for ; Sun, 10 Apr 2016 15:05:15 +0000 (UTC) Received: (qmail 49691 invoked by uid 500); 10 Apr 2016 15:05:14 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 49630 invoked by uid 500); 10 Apr 2016 15:05:14 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 49620 invoked by uid 99); 10 Apr 2016 15:05:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Apr 2016 15:05:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CBEC0C217C for ; Sun, 10 Apr 2016 15:05:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=teralytics.ch Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id K9KR51E2EVrH for ; Sun, 10 Apr 2016 15:05:10 +0000 (UTC) Received: from mail-yw0-f174.google.com (mail-yw0-f174.google.com [209.85.161.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 190725F642 for ; Sun, 10 Apr 2016 15:05:10 +0000 (UTC) Received: by mail-yw0-f174.google.com with SMTP id i84so176625684ywc.2 for ; Sun, 10 Apr 2016 08:05:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=teralytics.ch; s=google; h=mime-version:date:message-id:subject:from:to; bh=CrLidborxp6skfeIQYw+CCXeBAT7wHbDpXxBXlEdVrk=; b=celCckI9dKzfhoaT8y2njW0eNzZXjY4t1mhgKSxKRNEoj3LYbwUUk9wZVxBxICePfW UePLewgLhPGi0flntpEl10B/Nw6DSkQjnJ5MNjN1izHbfeg3Q8XU3/g/YOgFn/GMnwXz /61Lg4IgHQIYDHeYiVWu4qfk2Rm8t7Iag0TpQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=CrLidborxp6skfeIQYw+CCXeBAT7wHbDpXxBXlEdVrk=; b=AonWNiCsG6vbGWU8nZX5i1+u3wfsES7iQeao5YClEknExDFuqjHxAuktpeDI0sVAR/ fQp7oho4MGdu2pVHFDOFe0lur+rtMWE1vglEeJrFSSbdqLCHI8BSugAnQ2bndAh79A7J 8G8/MKdXpB5v35lNUB4XXluVZCNzgSc4mh8bJDdeZgR6Y5mBOwe3mpqzryT82xpmV/y2 i+gyKw+dMmxWFQVsm3yn/KH3lH+yN0+ahbejHECOcvkAPNhTIfnLIZAq9YA+VaDAcyyj vlWvdmL3/13U1oD5QhCXV7dHmnz0IqSEjsBjtd6DMvrEHDcCOeEMoViFRbfYV71XUJQU i5QQ== X-Gm-Message-State: AD7BkJLzisCylaL50a7LqbOTSldfY5dz5Kn57aXtm/SmOKBloozldvYxXMbywcGrnoNcSwYGYdWjmA6veXTblg== MIME-Version: 1.0 X-Received: by 10.129.78.22 with SMTP id c22mr8953626ywb.136.1460300709065; Sun, 10 Apr 2016 08:05:09 -0700 (PDT) Received: by 10.37.64.76 with HTTP; Sun, 10 Apr 2016 08:05:09 -0700 (PDT) Date: Sun, 10 Apr 2016 17:05:09 +0200 Message-ID: Subject: Optimize Accumulo scan speed From: Mario Pastorelli To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a114db8f807a8cd053022c195 --001a114db8f807a8cd053022c195 Content-Type: text/plain; charset=UTF-8 Hi, I'm currently having some scan speed issues with Accumulo and I would like to understand why and how can I solve it. I have geographical data and I use as primary key the day and then the geohex, which is a linearisation of lat and lon. The reason for this key is that I always query the data for one day but for a set of geohexes with represent a zone, so with this schema I can scan use a single scan to read all the data for one day with few seeks. My problem is that the scan is painfully slow: for instance, to read 5617019 rows it takes around 17 seconds and the scan speed is 13MB/s, less than 750k scan entries/s and around 300 seeks. I enable the tracer and this is what I've got 17325+0 Dice@srv1 Dice.query 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location I'm not sure how to speedup the scanning. I have the following question: - is this speed normal? - can I involve more servers in the scan? Right now only two server have the ranges but with a cluster of 15 machines it would be nice to involve more of them. Is it possible? Thanks, Mario -- Mario Pastorelli | TERALYTICS *software engineer* Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland phone: +41794381682 email: mario.pastorelli@teralytics.ch www.teralytics.net Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately. --001a114db8f807a8cd053022c195 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I'm currently having some scan s= peed issues with Accumulo and I would like to understand why and how can I = solve it. I have geographical data and I use as primary key the day and the= n the geohex, which is a linearisation of lat and lon. The reason for this = key is that I always query the data for one day but for a set of geohexes w= ith represent a zone, so with this schema I can scan use a single scan to r= ead all the data for one day with few seeks. My problem is that the scan is= painfully slow: for instance, to read 5617019 rows it takes around 17 seco= nds and the scan speed is 13MB/s, less than 750k scan entries/s and around = 300 seeks. I enable the tracer and this is what I've got

17325+0 Dice@srv1 Dice.query
11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
I'm not sure how to speedup the scanning. I = have the following question:
=C2=A0 - is this speed normal?
=C2=A0 - can I involve more servers in the scan? Right now onl= y two server have the ranges but with a cluster of 15 machines it would be = nice to involve more of them. Is it possible?

Thanks,
=
Mario

--
=
Mario Pastorelli | TERALYTICS

software engineer

Teralytics AG |=C2=A0Zollst= rasse 62 | 8005 Zurich=C2=A0| Switzerland=C2=A0
phone:
+41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Tr= ade register Canton Zurich
Board of directors: Georg Polzer, Luc= iano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential i= nformation which is for the sole attention and use of the intended recipient. Please notify us at once = if you think that it may not be intended for you and delete it immediately.

--001a114db8f807a8cd053022c195--