Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <57D8C156.4070405@gmail.com>
References: <393687421.762328.1472044939599.JavaMail.zimbra@scai.fraunhofer.de>
 <CAGUtCHr2FyCVH1bUp6fGnsQpsj2+f9Nsk65SSet94WOdcrvQzA@mail.gmail.com>
 <2089159696.944715.1472627172729.JavaMail.zimbra@scai.fraunhofer.de>
 <57D482A8.7030507@gmail.com> <57D82FE4.8080103@gmail.com> <57D8C156.4070405@gmail.com>
From: Dylan Hutchison <dhutchis@cs.washington.edu>
Date: Wed, 14 Sep 2016 06:52:42 -0700
Message-ID: <CAPx=JkZHKa+jnA0ipt+WL4f=O7DWNABdtGgvZnM1J4xe288kYQ@mail.gmail.com>
Subject: Re: Accumulo Seek performance
To: Accumulo User List <user@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=001a11458c3e1bde38053c780b8f
archived-at: Wed, 14 Sep 2016 13:53:04 -0000

--001a11458c3e1bde38053c780b8f
Content-Type: text/plain; charset=UTF-8

Do we have a (hopefully reproducible) conclusion from this thread,
regarding Scanners and BatchScanners?

On Sep 13, 2016 11:17 PM, "Josh Elser" <josh.elser@gmail.com> wrote:

> Yeah, this seems to have been osx causing me grief.
>
> Spun up a 3tserver cluster (on openstack, even) and reran the same
> experiment. I could not reproduce the issues, even without substantial
> config tweaking.
>
> Josh Elser wrote:
>
>> I'm playing around with this a little more today and something is
>> definitely weird on my local machine. I'm seeing insane spikes in
>> performance using Scanners too.
>>
>> Coupled with Keith's inability to repro this, I am starting to think
>> that these are not worthwhile numbers to put weight behind. Something I
>> haven't been able to figure out is quite screwy for me.
>>
>> Josh Elser wrote:
>>
>>> Sven, et al:
>>>
>>> So, it would appear that I have been able to reproduce this one (better
>>> late than never, I guess...). tl;dr Serially using Scanners to do point
>>> lookups instead of a BatchScanner is ~20x faster. This sounds like a
>>> pretty serious performance issue to me.
>>>
>>> Here's a general outline for what I did.
>>>
>>> * Accumulo 1.8.0
>>> * Created a table with 1M rows, each row with 10 columns using YCSB
>>> (workloada)
>>> * Split the table into 9 tablets
>>> * Computed the set of all rows in the table
>>>
>>> For a number of iterations:
>>> * Shuffle this set of rows
>>> * Choose the first N rows
>>> * Construct an equivalent set of Ranges from the set of Rows, choosing a
>>> random column (0-9)
>>> * Partition the N rows into X collections
>>> * Submit X tasks to query one partition of the N rows (to a thread pool
>>> with X fixed threads)
>>>
>>> I have two implementations of these tasks. One, where all ranges in a
>>> partition are executed via one BatchWriter. A second where each range is
>>> executed in serial using a Scanner. The numbers speak for themselves.
>>>
>>> ** BatchScanners **
>>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>>> rows
>>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
>>> calculated: 3000 ranges found
>>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 40178 ms
>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 42296 ms
>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 46094 ms
>>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 47704 ms
>>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 49221 ms
>>>
>>> ** Scanners **
>>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>>> rows
>>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
>>> calculated: 3000 ranges found
>>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 2833 ms
>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 2536 ms
>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 2150 ms
>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 2061 ms
>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
>>> range partitions using a pool of 6 threads
>>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
>>> executed in 2140 ms
>>>
>>> Query code is available
>>> https://github.com/joshelser/accumulo-range-binning
>>>
>>> Sven Hodapp wrote:
>>>
>>>> Hi Keith,
>>>>
>>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
>>>> amazing differences.
>>>> Maybe it's a problem with the table structure? For example it may
>>>> happen that one row id (e.g. a sentence) has several thousand column
>>>> families. Can this affect the seek performance?
>>>>
>>>> So for my initial example it has about 3000 row ids to seek, which
>>>> will return about 500k entries. If I filter for specific column
>>>> families (e.g. a document without annotations) it will return about 5k
>>>> entries, but the seek time will only be halved.
>>>> Are there to much column families to seek it fast?
>>>>
>>>> Thanks!
>>>>
>>>> Regards,
>>>> Sven
>>>>
>>>>

--001a11458c3e1bde38053c780b8f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Do we have a (hopefully reproducible) conclusion from this t=
hread, regarding Scanners and BatchScanners?</p>
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sep 13, 2016 1=
1:17 PM, &quot;Josh Elser&quot; &lt;<a href=3D"mailto:josh.elser@gmail.com"=
>josh.elser@gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">Yeah, this seems to have been osx causing me grief.<br>
<br>
Spun up a 3tserver cluster (on openstack, even) and reran the same experime=
nt. I could not reproduce the issues, even without substantial config tweak=
ing.<br>
<br>
Josh Elser wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I&#39;m playing around with this a little more today and something is<br>
definitely weird on my local machine. I&#39;m seeing insane spikes in<br>
performance using Scanners too.<br>
<br>
Coupled with Keith&#39;s inability to repro this, I am starting to think<br=
>
that these are not worthwhile numbers to put weight behind. Something I<br>
haven&#39;t been able to figure out is quite screwy for me.<br>
<br>
Josh Elser wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Sven, et al:<br>
<br>
So, it would appear that I have been able to reproduce this one (better<br>
late than never, I guess...). tl;dr Serially using Scanners to do point<br>
lookups instead of a BatchScanner is ~20x faster. This sounds like a<br>
pretty serious performance issue to me.<br>
<br>
Here&#39;s a general outline for what I did.<br>
<br>
* Accumulo 1.8.0<br>
* Created a table with 1M rows, each row with 10 columns using YCSB<br>
(workloada)<br>
* Split the table into 9 tablets<br>
* Computed the set of all rows in the table<br>
<br>
For a number of iterations:<br>
* Shuffle this set of rows<br>
* Choose the first N rows<br>
* Construct an equivalent set of Ranges from the set of Rows, choosing a<br=
>
random column (0-9)<br>
* Partition the N rows into X collections<br>
* Submit X tasks to query one partition of the N rows (to a thread pool<br>
with X fixed threads)<br>
<br>
I have two implementations of these tasks. One, where all ranges in a<br>
partition are executed via one BatchWriter. A second where each range is<br=
>
executed in serial using a Scanner. The numbers speak for themselves.<br>
<br>
** BatchScanners **<br>
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all<br=
>
rows<br>
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges<br>
calculated: 3000 ranges found<br>
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 40178 ms<br>
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 42296 ms<br>
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 46094 ms<br>
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 47704 ms<br>
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 49221 ms<br>
<br>
** Scanners **<br>
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all<br=
>
rows<br>
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges<br>
calculated: 3000 ranges found<br>
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2833 ms<br>
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2536 ms<br>
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2150 ms<br>
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2061 ms<br>
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6<br>
range partitions using a pool of 6 threads<br>
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries<br>
executed in 2140 ms<br>
<br>
Query code is available<br>
<a href=3D"https://github.com/joshelser/accumulo-range-binning" rel=3D"nore=
ferrer" target=3D"_blank">https://github.com/joshelser/a<wbr>ccumulo-range-=
binning</a><br>
<br>
Sven Hodapp wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi Keith,<br>
<br>
I&#39;ve tried it with 1, 2 or 10 threads. Unfortunately there where no<br>
amazing differences.<br>
Maybe it&#39;s a problem with the table structure? For example it may<br>
happen that one row id (e.g. a sentence) has several thousand column<br>
families. Can this affect the seek performance?<br>
<br>
So for my initial example it has about 3000 row ids to seek, which<br>
will return about 500k entries. If I filter for specific column<br>
families (e.g. a document without annotations) it will return about 5k<br>
entries, but the seek time will only be halved.<br>
Are there to much column families to seek it fast?<br>
<br>
Thanks!<br>
<br>
Regards,<br>
Sven<br>
<br>
</blockquote></blockquote></blockquote>
</blockquote></div></div>

--001a11458c3e1bde38053c780b8f--