accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <michael.m...@gmail.com>
Subject Re: Accumulo Seek performance
Date Wed, 14 Sep 2016 16:35:00 GMT
Setting the log level to trace helps, but overall, lack of "traditional" db
metrics has been a huge pain point for us as well.

On Wed, Sep 14, 2016 at 10:04 AM, Josh Elser <josh.elser@gmail.com> wrote:

> Nope! My test harness (the github repo) doesn't show any noticeable
> difference between BatchScanner and Scanner. Would have to do more digging
> with Sven to figure out what's happening.
>
> One takeaway is lack of metrics to tell us what is actually happening is a
> major defect, imo.
>
> On Sep 14, 2016 9:53 AM, "Dylan Hutchison" <dhutchis@cs.washington.edu>
> wrote:
>
>> Do we have a (hopefully reproducible) conclusion from this thread,
>> regarding Scanners and BatchScanners?
>>
>> On Sep 13, 2016 11:17 PM, "Josh Elser" <josh.elser@gmail.com> wrote:
>>
>>> Yeah, this seems to have been osx causing me grief.
>>>
>>> Spun up a 3tserver cluster (on openstack, even) and reran the same
>>> experiment. I could not reproduce the issues, even without substantial
>>> config tweaking.
>>>
>>> Josh Elser wrote:
>>>
>>>> I'm playing around with this a little more today and something is
>>>> definitely weird on my local machine. I'm seeing insane spikes in
>>>> performance using Scanners too.
>>>>
>>>> Coupled with Keith's inability to repro this, I am starting to think
>>>> that these are not worthwhile numbers to put weight behind. Something I
>>>> haven't been able to figure out is quite screwy for me.
>>>>
>>>> Josh Elser wrote:
>>>>
>>>>> Sven, et al:
>>>>>
>>>>> So, it would appear that I have been able to reproduce this one (better
>>>>> late than never, I guess...). tl;dr Serially using Scanners to do point
>>>>> lookups instead of a BatchScanner is ~20x faster. This sounds like a
>>>>> pretty serious performance issue to me.
>>>>>
>>>>> Here's a general outline for what I did.
>>>>>
>>>>> * Accumulo 1.8.0
>>>>> * Created a table with 1M rows, each row with 10 columns using YCSB
>>>>> (workloada)
>>>>> * Split the table into 9 tablets
>>>>> * Computed the set of all rows in the table
>>>>>
>>>>> For a number of iterations:
>>>>> * Shuffle this set of rows
>>>>> * Choose the first N rows
>>>>> * Construct an equivalent set of Ranges from the set of Rows, choosing
>>>>> a
>>>>> random column (0-9)
>>>>> * Partition the N rows into X collections
>>>>> * Submit X tasks to query one partition of the N rows (to a thread pool
>>>>> with X fixed threads)
>>>>>
>>>>> I have two implementations of these tasks. One, where all ranges in a
>>>>> partition are executed via one BatchWriter. A second where each range
>>>>> is
>>>>> executed in serial using a Scanner. The numbers speak for themselves.
>>>>>
>>>>> ** BatchScanners **
>>>>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled
>>>>> all
>>>>> rows
>>>>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
>>>>> calculated: 3000 ranges found
>>>>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 40178 ms
>>>>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 42296 ms
>>>>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 46094 ms
>>>>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 47704 ms
>>>>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 49221 ms
>>>>>
>>>>> ** Scanners **
>>>>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled
>>>>> all
>>>>> rows
>>>>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
>>>>> calculated: 3000 ranges found
>>>>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 2833 ms
>>>>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 2536 ms
>>>>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 2150 ms
>>>>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 2061 ms
>>>>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing
6
>>>>> range partitions using a pool of 6 threads
>>>>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
>>>>> executed in 2140 ms
>>>>>
>>>>> Query code is available
>>>>> https://github.com/joshelser/accumulo-range-binning
>>>>>
>>>>> Sven Hodapp wrote:
>>>>>
>>>>>> Hi Keith,
>>>>>>
>>>>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where
no
>>>>>> amazing differences.
>>>>>> Maybe it's a problem with the table structure? For example it may
>>>>>> happen that one row id (e.g. a sentence) has several thousand column
>>>>>> families. Can this affect the seek performance?
>>>>>>
>>>>>> So for my initial example it has about 3000 row ids to seek, which
>>>>>> will return about 500k entries. If I filter for specific column
>>>>>> families (e.g. a document without annotations) it will return about
5k
>>>>>> entries, but the seek time will only be halved.
>>>>>> Are there to much column families to seek it fast?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Regards,
>>>>>> Sven
>>>>>>
>>>>>>

Mime
View raw message