Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <CAPx=JkZHKa+jnA0ipt+WL4f=O7DWNABdtGgvZnM1J4xe288kYQ@mail.gmail.com>
References: <393687421.762328.1472044939599.JavaMail.zimbra@scai.fraunhofer.de>
 <CAGUtCHr2FyCVH1bUp6fGnsQpsj2+f9Nsk65SSet94WOdcrvQzA@mail.gmail.com>
 <2089159696.944715.1472627172729.JavaMail.zimbra@scai.fraunhofer.de>
 <57D482A8.7030507@gmail.com> <57D82FE4.8080103@gmail.com> <57D8C156.4070405@gmail.com>
 <CAPx=JkZHKa+jnA0ipt+WL4f=O7DWNABdtGgvZnM1J4xe288kYQ@mail.gmail.com>
From: Josh Elser <josh.elser@gmail.com>
Date: Wed, 14 Sep 2016 10:04:20 -0400
Message-ID: <CAOyNEs4DAGAaXxm42MjjNhPxiGD0OFUn26vc82i+d_EtXPNrSQ@mail.gmail.com>
Subject: Re: Accumulo Seek performance
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=001a11407234bc567d053c7834b9
archived-at: Wed, 14 Sep 2016 14:04:29 -0000

--001a11407234bc567d053c7834b9
Content-Type: text/plain; charset=UTF-8

Nope! My test harness (the github repo) doesn't show any noticeable
difference between BatchScanner and Scanner. Would have to do more digging
with Sven to figure out what's happening.

One takeaway is lack of metrics to tell us what is actually happening is a
major defect, imo.

On Sep 14, 2016 9:53 AM, "Dylan Hutchison" <dhutchis@cs.washington.edu>
wrote:

> Do we have a (hopefully reproducible) conclusion from this thread,
> regarding Scanners and BatchScanners?
>
> On Sep 13, 2016 11:17 PM, "Josh Elser" <josh.elser@gmail.com> wrote:
>
>> Yeah, this seems to have been osx causing me grief.
>>
>> Spun up a 3tserver cluster (on openstack, even) and reran the same
>> experiment. I could not reproduce the issues, even without substantial
>> config tweaking.
>>
>> Josh Elser wrote:
>>
>>> I'm playing around with this a little more today and something is
>>> definitely weird on my local machine. I'm seeing insane spikes in
>>> performance using Scanners too.
>>>
>>> Coupled with Keith's inability to repro this, I am starting to think
>>> that these are not worthwhile numbers to put weight behind. Something I
>>> haven't been able to figure out is quite screwy for me.
>>>
>>> Josh Elser wrote:
>>>
>>>> Sven, et al:
>>>>
>>>> So, it would appear that I have been able to reproduce this one (better
>>>> late than never, I guess...). tl;dr Serially using Scanners to do point
>>>> lookups instead of a BatchScanner is ~20x faster. This sounds like a
>>>> pretty serious performance issue to me.
>>>>
>>>> Here's a general outline for what I did.
>>>>
>>>> * Accumulo 1.8.0
>>>> * Created a table with 1M rows, each row with 10 columns using YCSB
>>>> (workloada)
>>>> * Split the table into 9 tablets
>>>> * Computed the set of all rows in the table
>>>>
>>>> For a number of iterations:
>>>> * Shuffle this set of rows
>>>> * Choose the first N rows
>>>> * Construct an equivalent set of Ranges from the set of Rows, choosing a
>>>> random column (0-9)
>>>> * Partition the N rows into X collections
>>>> * Submit X tasks to query one partition of the N rows (to a thread pool
>>>> with X fixed threads)
>>>>
>>>> I have two implementations of these tasks. One, where all ranges in a
>>>> partition are executed via one BatchWriter. A second where each range is
>>>> executed in serial using a Scanner. The numbers speak for themselves.
>>>>
>>>> ** BatchScanners **
>>>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>>>> rows
>>>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
>>>> calculated: 3000 ranges found
>>>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 40178 ms
>>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 42296 ms
>>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 46094 ms
>>>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 47704 ms
>>>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 49221 ms
>>>>
>>>> ** Scanners **
>>>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>>>> rows
>>>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
>>>> calculated: 3000 ranges found
>>>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 2833 ms
>>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 2536 ms
>>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 2150 ms
>>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 2061 ms
>>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>>> range partitions using a pool of 6 threads
>>>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
>>>> executed in 2140 ms
>>>>
>>>> Query code is available
>>>> https://github.com/joshelser/accumulo-range-binning
>>>>
>>>> Sven Hodapp wrote:
>>>>
>>>>> Hi Keith,
>>>>>
>>>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
>>>>> amazing differences.
>>>>> Maybe it's a problem with the table structure? For example it may
>>>>> happen that one row id (e.g. a sentence) has several thousand column
>>>>> families. Can this affect the seek performance?
>>>>>
>>>>> So for my initial example it has about 3000 row ids to seek, which
>>>>> will return about 500k entries. If I filter for specific column
>>>>> families (e.g. a document without annotations) it will return about 5k
>>>>> entries, but the seek time will only be halved.
>>>>> Are there to much column families to seek it fast?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Regards,
>>>>> Sven
>>>>>
>>>>>

--001a11407234bc567d053c7834b9
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Nope! My test harness (the github repo) doesn&#39;t show any=
 noticeable difference between BatchScanner and Scanner. Would have to do m=
ore digging with Sven to figure out what&#39;s happening.</p>
<p dir=3D"ltr">One takeaway is lack of metrics to tell us what is actually =
happening is a major defect, imo. </p>
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sep 14, 2016 9=
:53 AM, &quot;Dylan Hutchison&quot; &lt;<a href=3D"mailto:dhutchis@cs.washi=
ngton.edu">dhutchis@cs.washington.edu</a>&gt; wrote:<br type=3D"attribution=
"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">Do we have a (hopefully rep=
roducible) conclusion from this thread, regarding Scanners and BatchScanner=
s?</p>
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sep 13, 2016 1=
1:17 PM, &quot;Josh Elser&quot; &lt;<a href=3D"mailto:josh.elser@gmail.com"=
 target=3D"_blank">josh.elser@gmail.com</a>&gt; wrote:<br type=3D"attributi=
on"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">Yeah, this seems to have been osx causin=
g me grief.<br>
<br>
Spun up a 3tserver cluster (on openstack, even) and reran the same experime=
nt. I could not reproduce the issues, even without substantial config tweak=
ing.<br>
<br>
Josh Elser wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I&#39;m playing around with this a little more today and something is<br>
definitely weird on my local machine. I&#39;m seeing insane spikes in<br>
performance using Scanners too.<br>
<br>
Coupled with Keith&#39;s inability to repro this, I am starting to think<br=
>
that these are not worthwhile numbers to put weight behind. Something I<br>
haven&#39;t been able to figure out is quite screwy for me.<br>
<br>
Josh Elser wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Sven, et al:<br>
<br>
So, it would appear that I have been able to reproduce this one (better<br>
late than never, I guess...). tl;dr Serially using Scanners to do point<br>
lookups instead of a BatchScanner is ~20x faster. This sounds like a<br>
pretty serious performance issue to me.<br>
<br>
Here&#39;s a general outline for what I did.<br>
<br>
* Accumulo 1.8.0<br>
* Created a table with 1M rows, each row with 10 columns using YCSB<br>
(workloada)<br>
* Split the table into 9 tablets<br>
* Computed the set of all rows in the table<br>
<br>
For a number of iterations:<br>
* Shuffle this set of rows<br>
* Choose the first N rows<br>
* Construct an equivalent set of Ranges from the set of Rows, choosing a<br=
>
random column (0-9)<br>
* Partition the N rows into X collections<br>
* Submit X tasks to query one partition of the N rows (to a thread pool<br>
with X fixed threads)<br>
<br>
I have two implementations of these tasks. One, where all ranges in a<br>
partition are executed via one BatchWriter. A second where each range is<br=
>
executed in serial using a Scanner. The numbers speak for themselves.<br>
<br>
** BatchScanners **<br>
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all<br=
>
rows<br>
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges<br>
calculated: 3000 ranges found<br>
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 40178 ms<br>
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 42296 ms<br>
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 46094 ms<br>
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 47704 ms<br>
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 49221 ms<br>
<br>
** Scanners **<br>
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all<br=
>
rows<br>
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges<br>
calculated: 3000 ranges found<br>
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2833 ms<br>
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2536 ms<br>
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2150 ms<br>
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2061 ms<br>
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2140 ms<br>
<br>
Query code is available<br>
<a href=3D"https://github.com/joshelser/accumulo-range-binning" rel=3D"nore=
ferrer" target=3D"_blank">https://github.com/joshelser/a<wbr>ccumulo-range-=
binning</a><br>
<br>
Sven Hodapp wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi Keith,<br>
<br>
I&#39;ve tried it with 1, 2 or 10 threads. Unfortunately there where no<br>
amazing differences.<br>
Maybe it&#39;s a problem with the table structure? For example it may<br>
happen that one row id (e.g. a sentence) has several thousand column<br>
families. Can this affect the seek performance?<br>
<br>
So for my initial example it has about 3000 row ids to seek, which<br>
will return about 500k entries. If I filter for specific column<br>
families (e.g. a document without annotations) it will return about 5k<br>
entries, but the seek time will only be halved.<br>
Are there to much column families to seek it fast?<br>
<br>
Thanks!<br>
<br>
Regards,<br>
Sven<br>
<br>
</blockquote></blockquote></blockquote>
</blockquote></div></div>
</blockquote></div></div>

--001a11407234bc567d053c7834b9--