Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6236F2009F3 for ; Fri, 20 May 2016 08:53:20 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 60F66160A0E; Fri, 20 May 2016 06:53:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5EA451609AF for ; Fri, 20 May 2016 08:53:19 +0200 (CEST) Received: (qmail 44803 invoked by uid 500); 20 May 2016 06:53:18 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 44793 invoked by uid 99); 20 May 2016 06:53:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 May 2016 06:53:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 5FDE61A082A for ; Fri, 20 May 2016 06:53:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=teralytics.ch Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id aCCvGL7t16yz for ; Fri, 20 May 2016 06:53:14 +0000 (UTC) Received: from mail-yw0-f171.google.com (mail-yw0-f171.google.com [209.85.161.171]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id A43655F254 for ; Fri, 20 May 2016 06:53:13 +0000 (UTC) Received: by mail-yw0-f171.google.com with SMTP id j74so101023159ywg.1 for ; Thu, 19 May 2016 23:53:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=teralytics.ch; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=ea848j4IusglWrDf1+H4j8aheCUl5/6VoioAp24Nc6U=; b=i1AQopplG7WSRkAYpO6hDE0OjXKhOM+icDY+xdNhzJA+qYasNbug71oFMuRFaSWWXI Zsz0jgWYo92PAon1Y8Mevnd153atak28Z+85UoZIzBnwICfn3egL80YgH5hTKRD3n7pH I0tnUaTbO6DdpnIY3EOjgXV3LGW9IQQVLaKGU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=ea848j4IusglWrDf1+H4j8aheCUl5/6VoioAp24Nc6U=; b=S+L8Qu682uGZoMYl894qFI0HhljDuH0bFMiVEqp9Dfbq6bngHnFxDxYIlrJV2fMKWD WxSPRyf91a523oLxR4JtYdXlkoHAFsDYIgqIaffv5ZORAL2RgdriXR8C6UD9yDnMopC4 PtQoN4lkmtkKIFUWTsHSsBqcDAvN/9DHQWYPJWQvcnX/+l4bwN0+HvGtns1rYi4ptDlZ 4vm0bUsG0bXjf2gQbMVyzgNQXXtnA29kiwdtoEMHn5pHne0AuJL9h2PQCPIqmkx06suP iC2udXnm4gIAamVFKvLiTV8cF4Av//70tGhEAc+rKe98sF+dSvl3lmZP5vSA7CExF7Yt 2Wsw== X-Gm-Message-State: AOPr4FXdBlIkDxAVgZiWlXF4TQGN/dxQZPj+l7VuNEZS4Ir777CHgi/6Z1HBmyQ0juY5sG0xkd9WV/Z7iWzHeg== MIME-Version: 1.0 X-Received: by 10.37.228.193 with SMTP id b184mr805700ybh.4.1463727191604; Thu, 19 May 2016 23:53:11 -0700 (PDT) Received: by 10.37.24.194 with HTTP; Thu, 19 May 2016 23:53:11 -0700 (PDT) In-Reply-To: References: Date: Fri, 20 May 2016 08:53:11 +0200 Message-ID: Subject: Re: Feedback about techniques for tuning batch scanning for my problem From: Mario Pastorelli To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a114bb04a4df0130533408b17 archived-at: Fri, 20 May 2016 06:53:20 -0000 --001a114bb04a4df0130533408b17 Content-Type: text/plain; charset=UTF-8 We haven't, thanks for the tips. On Thu, May 19, 2016 at 5:53 PM, Marc Reichman wrote: > Hi Mario, > > Not sure where this plays into your data integrity, but have you looked > into these settings in hdfs-site.xml? > dfs.client.read.shortcircuit > dfs.client.read.shortcircuit.skip.checksum > dfs.domain.socket.path > > These make for a somewhat dramatic increase in HDFS read performance if > data is distributed well enough around.. > > I can't speak as much to the scanner params, but you may look into these > as well. > > Marc > > On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli < > mario.pastorelli@teralytics.ch> wrote: > >> Hey people, >> I'm trying to tune a bit the query performance to see how fast it can go >> and I thought it would be great to have comments from the community. The >> problem that I'm trying to solve in Accumulo is the following: we want to >> store the entities that have been in a certain location in a certain day. >> The location is a Long and the entity id is a Long. I want to be able to >> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm >> doing the following things: >> >> 1. I'm using a sharding byte at the start of the rowId to keep the >> data in the same range distributed in the cluster >> 2. all the records are encoded, one single record is composed by >> 1. rowId: 1 shard byte + 3 bytes for the day >> 2. column family: 8 byte for the long corresponding to the hash of >> the location >> 3. column qualifier: 8 byte corresponding to the identifier of the >> entity >> 4. value: 2 bytes for some additional information >> 3. I use a batch scanner because I don't need sorting and it's faster >> >> As expected, it takes few seconds to scan 1M rows but now I'm wondering >> if I can improve it. My ideas are the following: >> >> 1. set table.compaction.major.ration to 1 because I don't care about >> the ingestion performance and this should improve the query performance >> 2. pre-split tables to match the number of servers and then use a >> byte of shard as first byte of the rowId. This should improve both writing >> and reading the data because both should work in parallel for what I >> understood >> 3. enable bloom filter on the table >> >> Do you think those ideas make sense? Furthermore, I have two questions: >> >> 1. considering that a single entry is only 22 bytes but I'm going to >> scan ~1M records per query, do you think I should change the BatchScanner >> buffers somehow? >> 2. anything else to improve the scan speed? Again, I don't care about >> the ingestion time >> >> Thanks for the help! >> >> -- >> Mario Pastorelli | TERALYTICS >> >> *software engineer* >> >> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >> phone: +41794381682 >> email: mario.pastorelli@teralytics.ch >> www.teralytics.net >> >> Company registration number: CH-020.3.037.709-7 | Trade register Canton >> Zurich >> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >> Yann de Vries >> >> This e-mail message contains confidential information which is for the >> sole attention and use of the intended recipient. Please notify us at once >> if you think that it may not be intended for you and delete it immediately. >> > > -- Mario Pastorelli | TERALYTICS *software engineer* Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland phone: +41794381682 email: mario.pastorelli@teralytics.ch www.teralytics.net Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately. --001a114bb04a4df0130533408b17 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
We haven't, thanks for the tips.

On Thu, May 19, 2016 at 5:5= 3 PM, Marc Reichman <mreichman@pixelforensics.com> wrote:
Hi Mario,
<= br>
Not sure where this plays into your data integrity, but have = you looked into these settings in hdfs-site.xml?
dfs.client.read.= shortcircuit
dfs.client.read.shortcircuit.skip.checksum
dfs.domain.socket.path

These make for= a somewhat dramatic increase in HDFS read performance if data is distribut= ed well enough around..

I can't speak as much = to the scanner params, but you may look into these as well.

Marc
<= /span>

On Thu, May 19, 2016 at 10:08 AM, Mario = Pastorelli <mario.pastorelli@teralytics.ch> wro= te:
Hey people,
= I'm trying to tune a bit the query performance to see how fast it can g= o and I thought it would be great to have comments from the community. The = problem that I'm trying to solve in Accumulo is the following: we want = to store the entities that have been in a certain location in a certain day= . The location is a Long and the entity id is a Long. I want to be able to = scan ~1M of rows in few seconds, possibly less than one. Right now, I'm= doing the following things:
  1. I'm using a sharding byte at th= e start of the rowId to keep the data in the same range distributed in the = cluster
  2. all the records are encoded, one single record is composed = by
    1. rowId: 1 shard byte + 3 bytes for the day
    2. column fam= ily: 8 byte for the long corresponding to the hash of the location
    3. = column qualifier: 8 byte corresponding to the identifier of the entity
    4. =
    5. value: 2 bytes for some additional information
  3. I use a bat= ch scanner because I don't need sorting and it's faster
As expected, it takes few seconds to scan 1M rows but now I'm wonderin= g if I can improve it. My ideas are the following:

  1. set= table.compaction.major.ration to 1 because I don't care about the inge= stion performance and this should improve the query performance
  2. pre-split tables to match the number of servers and then use a byte of sha= rd as first byte of the rowId. This should improve both writing and reading= the data because both should work in parallel for what I understoodenable bloom filter on the table
Do you think thos= e ideas make sense? Furthermore, I have two questions:
  1. consideri= ng that a single entry is only 22 bytes but I'm going to scan ~1M recor= ds per query, do you think I should change the BatchScanner buffers somehow= ?
  2. anything else to improve the scan speed? Again, I don't care = about the ingestion time

Thanks for the help!

=

--
Mario Pastorelli<= /span> | TERALYTICS

software engineer

Teralytics AG |=C2=A0Zollst= rasse 62 | 8005 Zurich=C2=A0| Switzerland=C2=A0
phone:
+41794381682
email: mario.pastorelli@teralytics.ch

www.ter= alytics.net

Company registration number: CH-020.3.037.709-7 | Trade regi= ster Canton Zurich
Board of directors: Georg Polzer, Luc= iano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential i= nformation which is for the sole attention and use of the intended recipient. Please notify us at once = if you think that it may not be intended for you and delete it immediately.





--
Mario Pastorelli<= /span> | TERALYTICS

software engineer

Teralytics AG |=C2=A0Zollst= rasse 62 | 8005 Zurich=C2=A0| Switzerland=C2=A0
phone:
+41794381682
email: mario.pastorelli@teralytics.ch

www.tera= lytics.net

Company registration number: CH-020.3.037.709-7 | Trade regis= ter Canton Zurich
Board of directors: Georg Polzer, Luc= iano Franceschina, Mark Schmitz, Yann de Vries

This e-mail message contains confidential i= nformation which is for the sole attention and use of the intended recipient. Please notify us at once = if you think that it may not be intended for you and delete it immediately.

--001a114bb04a4df0130533408b17--