Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <CADDp_G87XsQXGcQ5uSS4rfgg_c6Nf7XyKj0dSDWKfs2Rp1znsg@mail.gmail.com>
References: <CACB+WM52H8ROyAZx9ENifKpKLd_8Fv1Xt2Qr_531ZDVG=K+M7Q@mail.gmail.com>
	<CADDp_G87XsQXGcQ5uSS4rfgg_c6Nf7XyKj0dSDWKfs2Rp1znsg@mail.gmail.com>
Date: Fri, 20 May 2016 08:53:11 +0200
Message-ID: <CACB+WM5YwaDWSxsKgzqfmmp+77ViQ-c9AWKrL4Ltawbqni8Daw@mail.gmail.com>
Subject: Re: Feedback about techniques for tuning batch scanning for my problem
From: Mario Pastorelli <mario.pastorelli@teralytics.ch>
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=001a114bb04a4df0130533408b17
archived-at: Fri, 20 May 2016 06:53:20 -0000

--001a114bb04a4df0130533408b17
Content-Type: text/plain; charset=UTF-8

We haven't, thanks for the tips.

On Thu, May 19, 2016 at 5:53 PM, Marc Reichman <mreichman@pixelforensics.com
> wrote:

> Hi Mario,
>
> Not sure where this plays into your data integrity, but have you looked
> into these settings in hdfs-site.xml?
> dfs.client.read.shortcircuit
> dfs.client.read.shortcircuit.skip.checksum
> dfs.domain.socket.path
>
> These make for a somewhat dramatic increase in HDFS read performance if
> data is distributed well enough around..
>
> I can't speak as much to the scanner params, but you may look into these
> as well.
>
> Marc
>
> On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli <
> mario.pastorelli@teralytics.ch> wrote:
>
>> Hey people,
>> I'm trying to tune a bit the query performance to see how fast it can go
>> and I thought it would be great to have comments from the community. The
>> problem that I'm trying to solve in Accumulo is the following: we want to
>> store the entities that have been in a certain location in a certain day.
>> The location is a Long and the entity id is a Long. I want to be able to
>> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm
>> doing the following things:
>>
>>    1. I'm using a sharding byte at the start of the rowId to keep the
>>    data in the same range distributed in the cluster
>>    2. all the records are encoded, one single record is composed by
>>       1. rowId: 1 shard byte + 3 bytes for the day
>>       2. column family: 8 byte for the long corresponding to the hash of
>>       the location
>>       3. column qualifier: 8 byte corresponding to the identifier of the
>>       entity
>>       4. value: 2 bytes for some additional information
>>    3. I use a batch scanner because I don't need sorting and it's faster
>>
>> As expected, it takes few seconds to scan 1M rows but now I'm wondering
>> if I can improve it. My ideas are the following:
>>
>>    1. set table.compaction.major.ration to 1 because I don't care about
>>    the ingestion performance and this should improve the query performance
>>    2. pre-split tables to match the number of servers and then use a
>>    byte of shard as first byte of the rowId. This should improve both writing
>>    and reading the data because both should work in parallel for what I
>>    understood
>>    3. enable bloom filter on the table
>>
>> Do you think those ideas make sense? Furthermore, I have two questions:
>>
>>    1. considering that a single entry is only 22 bytes but I'm going to
>>    scan ~1M records per query, do you think I should change the BatchScanner
>>    buffers somehow?
>>    2. anything else to improve the scan speed? Again, I don't care about
>>    the ingestion time
>>
>> Thanks for the help!
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>
>


-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastorelli@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

--001a114bb04a4df0130533408b17
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">We haven&#39;t, thanks for the tips.<br></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, May 19, 2016 at 5:5=
3 PM, Marc Reichman <span dir=3D"ltr">&lt;<a href=3D"mailto:mreichman@pixel=
forensics.com" target=3D"_blank">mreichman@pixelforensics.com</a>&gt;</span=
> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Mario,<div><=
br></div><div>Not sure where this plays into your data integrity, but have =
you looked into these settings in hdfs-site.xml?</div><div>dfs.client.read.=
shortcircuit<br></div><div>dfs.client.read.shortcircuit.skip.checksum<br></=
div><div>dfs.domain.socket.path<br></div><div><br></div><div>These make for=
 a somewhat dramatic increase in HDFS read performance if data is distribut=
ed well enough around..</div><div><br></div><div>I can&#39;t speak as much =
to the scanner params, but you may look into these as well.</div><span clas=
s=3D"HOEnZb"><font color=3D"#888888"><div><br></div><div>Marc</div></font><=
/span></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote">On Thu, May 19, 2016 at 10:08 AM, Mario =
Pastorelli <span dir=3D"ltr">&lt;<a href=3D"mailto:mario.pastorelli@teralyt=
ics.ch" target=3D"_blank">mario.pastorelli@teralytics.ch</a>&gt;</span> wro=
te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hey people,<br>=
I&#39;m trying to tune a bit the query performance to see how fast it can g=
o and I thought it would be great to have comments from the community. The =
problem that I&#39;m trying to solve in Accumulo is the following: we want =
to store the entities that have been in a certain location in a certain day=
. The location is a Long and the entity id is a Long. I want to be able to =
scan ~1M of rows in few seconds, possibly less than one. Right now, I&#39;m=
 doing the following things:<br><ol><li>I&#39;m using a sharding byte at th=
e start of the rowId to keep the data in the same range distributed in the =
cluster</li><li>all the records are encoded, one single record is composed =
by</li><ol><li>rowId: 1 shard byte + 3 bytes for the day</li><li>column fam=
ily: 8 byte for the long corresponding to the hash of the location</li><li>=
column qualifier: 8 byte corresponding to the identifier of the entity</li>=
<li>value: 2 bytes for some additional information</li></ol><li>I use a bat=
ch scanner because I don&#39;t need sorting and it&#39;s faster</li></ol><p=
>As expected, it takes few seconds to scan 1M rows but now I&#39;m wonderin=
g if I can improve it. My ideas are the following:<br></p></div><ol><li>set=
 table.compaction.major.ration to 1 because I don&#39;t care about the inge=
stion performance and this should improve the query performance<br></li><li=
>pre-split tables to match the number of servers and then use a byte of sha=
rd as first byte of the rowId. This should improve both writing and reading=
 the data because both should work in parallel for what I understood</li><l=
i>enable bloom filter on the table<br></li></ol><div><div>Do you think thos=
e ideas make sense? Furthermore, I have two questions:<br><ol><li>consideri=
ng that a single entry is only 22 bytes but I&#39;m going to scan ~1M recor=
ds per query, do you think I should change the BatchScanner buffers somehow=
?</li><li>anything else to improve the scan speed? Again, I don&#39;t care =
about the ingestion time<br></li></ol><p>Thanks for the help!<br></p></div>=
<div><br>-- <br><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><span style=3D"font-family:&#39;Open Sans=
9;,sans-serif" lang=3D"DE"><font color=3D"#267fcf">Mario Pastorelli</font><=
/span><span style=3D"font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(7=
8,78,78)" lang=3D"DE"> | TERA</span><span style=3D"font-family:&#39;Open Sa=
ns&#39;,sans-serif;color:rgb(38,127,207)" lang=3D"DE">LYTICS</span><br><div=
><div dir=3D"ltr"><p style=3D"color:rgb(80,0,80)"><span style=3D"font-size:=
18pt;font-family:Arial,sans-serif" lang=3D"DE"></span><b><span style=3D"fon=
t-size:9pt;font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(78,78,78)" =
lang=3D"DE">software engineer</span></b><span style=3D"font-size:9pt;font-f=
amily:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"></span></p><p style=3D"co=
lor:rgb(80,0,80)"><span style=3D"font-size:8pt;font-family:&#39;Open Sans&#=
39;,sans-serif;color:rgb(68,68,68)" lang=3D"DE">Teralytics AG |=C2=A0Zollst=
rasse 62 | 8005 Zurich=C2=A0| Switzerland</span><span style=3D"font-size:8p=
t;font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(78,78,78)" lang=3D"D=
E">=C2=A0<br>phone:</span><span style=3D"font-size:8pt;font-family:&#39;Ope=
n Sans&#39;,sans-serif" lang=3D"DE"><font color=3D"#444444"> </font><font c=
olor=3D"#3d85c6"><a href=3D"tel:%2B41794381682" value=3D"+41794381682" targ=
et=3D"_blank">+41794381682</a></font></span><span style=3D"font-size:8pt;fo=
nt-family:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"><br><font color=3D"#4=
e4e4e">email: <a href=3D"mailto:mario.pastorelli@teralytics.ch" target=3D"_=
blank">mario.pastorelli@teralytics.ch</a></font></span><span style=3D"font-=
size:8pt;font-family:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"><br><a hre=
f=3D"http://www.teralytics.net/" style=3D"color:rgb(17,85,204)" target=3D"_=
blank"><span style=3D"text-decoration:none"><font color=3D"#3d85c6">www.ter=
alytics.net</font></span></a></span></p><p style=3D"margin-bottom:0.0001pt;=
line-height:16pt;background-image:initial;background-repeat:initial"><span =
style=3D"color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:8pt;lin=
e-height:16pt">Company registration number: CH-020.3.037.709-7 | Trade regi=
ster Canton
Zurich<br></span><span style=3D"color:rgb(51,51,51);font-family:Arial,sans-=
serif;font-size:8pt;line-height:16pt">Board of directors: Georg Polzer, Luc=
iano Franceschina, Mark Schmitz, Yann de Vries</span></p><p style=3D"margin=
-bottom:0.0001pt;line-height:16pt;background-image:initial;background-repea=
t:initial"><span style=3D"color:rgb(51,51,51);font-family:Arial,sans-serif;=
font-size:8pt;line-height:16pt">This e-mail message contains confidential i=
nformation which is for the
sole attention and use of the intended recipient. Please notify us at once =
if
you think that it may not be intended for you and delete it immediately.</s=
pan><span style=3D"font-size:8pt;font-family:Arial,sans-serif;color:rgb(51,=
51,51)"></span></p></div></div></div></div></div></div></div></div></div></=
div>
</div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><div class=
=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><span style=3D"font-family:&#39;Open Sans=
9;,sans-serif" lang=3D"DE"><font color=3D"#267fcf">Mario Pastorelli</font><=
/span><span style=3D"font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(7=
8,78,78)" lang=3D"DE"> | TERA</span><span style=3D"font-family:&#39;Open Sa=
ns&#39;,sans-serif;color:rgb(38,127,207)" lang=3D"DE">LYTICS</span><br><div=
><div dir=3D"ltr"><p style=3D"color:rgb(80,0,80)"><span style=3D"font-size:=
18pt;font-family:Arial,sans-serif" lang=3D"DE"></span><b><span style=3D"fon=
t-size:9pt;font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(78,78,78)" =
lang=3D"DE">software engineer</span></b><span style=3D"font-size:9pt;font-f=
amily:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"></span></p><p style=3D"co=
lor:rgb(80,0,80)"><span style=3D"font-size:8pt;font-family:&#39;Open Sans&#=
39;,sans-serif;color:rgb(68,68,68)" lang=3D"DE">Teralytics AG |=C2=A0Zollst=
rasse 62 | 8005 Zurich=C2=A0| Switzerland</span><span style=3D"font-size:8p=
t;font-family:&#39;Open Sans&#39;,sans-serif;color:rgb(78,78,78)" lang=3D"D=
E">=C2=A0<br>phone:</span><span style=3D"font-size:8pt;font-family:&#39;Ope=
n Sans&#39;,sans-serif" lang=3D"DE"><font color=3D"#444444"> </font><font c=
olor=3D"#3d85c6">+41794381682</font></span><span style=3D"font-size:8pt;fon=
t-family:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"><br><font color=3D"#4e=
4e4e">email: <a href=3D"mailto:mario.pastorelli@teralytics.ch" target=3D"_b=
lank">mario.pastorelli@teralytics.ch</a></font></span><span style=3D"font-s=
ize:8pt;font-family:&#39;Open Sans&#39;,sans-serif" lang=3D"DE"><br><a href=
=3D"http://www.teralytics.net/" style=3D"color:rgb(17,85,204)" target=3D"_b=
lank"><span style=3D"text-decoration:none"><font color=3D"#3d85c6">www.tera=
lytics.net</font></span></a></span></p><p style=3D"margin-bottom:0.0001pt;l=
ine-height:16pt;background-image:initial;background-repeat:initial"><span s=
tyle=3D"color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:8pt;line=
-height:16pt">Company registration number: CH-020.3.037.709-7 | Trade regis=
ter Canton
Zurich<br></span><span style=3D"color:rgb(51,51,51);font-family:Arial,sans-=
serif;font-size:8pt;line-height:16pt">Board of directors: Georg Polzer, Luc=
iano Franceschina, Mark Schmitz, Yann de Vries</span></p><p style=3D"margin=
-bottom:0.0001pt;line-height:16pt;background-image:initial;background-repea=
t:initial"><span style=3D"color:rgb(51,51,51);font-family:Arial,sans-serif;=
font-size:8pt;line-height:16pt">This e-mail message contains confidential i=
nformation which is for the
sole attention and use of the intended recipient. Please notify us at once =
if
you think that it may not be intended for you and delete it immediately.</s=
pan><span style=3D"font-size:8pt;font-family:Arial,sans-serif;color:rgb(51,=
51,51)"></span></p></div></div></div></div></div></div></div></div></div></=
div>
</div>

--001a114bb04a4df0130533408b17--