accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Accumulo Seek performance
Date Wed, 14 Sep 2016 03:17:42 GMT
Yeah, this seems to have been osx causing me grief.

Spun up a 3tserver cluster (on openstack, even) and reran the same 
experiment. I could not reproduce the issues, even without substantial 
config tweaking.

Josh Elser wrote:
> I'm playing around with this a little more today and something is
> definitely weird on my local machine. I'm seeing insane spikes in
> performance using Scanners too.
>
> Coupled with Keith's inability to repro this, I am starting to think
> that these are not worthwhile numbers to put weight behind. Something I
> haven't been able to figure out is quite screwy for me.
>
> Josh Elser wrote:
>> Sven, et al:
>>
>> So, it would appear that I have been able to reproduce this one (better
>> late than never, I guess...). tl;dr Serially using Scanners to do point
>> lookups instead of a BatchScanner is ~20x faster. This sounds like a
>> pretty serious performance issue to me.
>>
>> Here's a general outline for what I did.
>>
>> * Accumulo 1.8.0
>> * Created a table with 1M rows, each row with 10 columns using YCSB
>> (workloada)
>> * Split the table into 9 tablets
>> * Computed the set of all rows in the table
>>
>> For a number of iterations:
>> * Shuffle this set of rows
>> * Choose the first N rows
>> * Construct an equivalent set of Ranges from the set of Rows, choosing a
>> random column (0-9)
>> * Partition the N rows into X collections
>> * Submit X tasks to query one partition of the N rows (to a thread pool
>> with X fixed threads)
>>
>> I have two implementations of these tasks. One, where all ranges in a
>> partition are executed via one BatchWriter. A second where each range is
>> executed in serial using a Scanner. The numbers speak for themselves.
>>
>> ** BatchScanners **
>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>> rows
>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
>> calculated: 3000 ranges found
>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 40178 ms
>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 42296 ms
>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 46094 ms
>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 47704 ms
>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 49221 ms
>>
>> ** Scanners **
>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>> rows
>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
>> calculated: 3000 ranges found
>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 2833 ms
>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 2536 ms
>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 2150 ms
>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 2061 ms
>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
>> executed in 2140 ms
>>
>> Query code is available
>> https://github.com/joshelser/accumulo-range-binning
>>
>> Sven Hodapp wrote:
>>> Hi Keith,
>>>
>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
>>> amazing differences.
>>> Maybe it's a problem with the table structure? For example it may
>>> happen that one row id (e.g. a sentence) has several thousand column
>>> families. Can this affect the seek performance?
>>>
>>> So for my initial example it has about 3000 row ids to seek, which
>>> will return about 500k entries. If I filter for specific column
>>> families (e.g. a document without annotations) it will return about 5k
>>> entries, but the seek time will only be halved.
>>> Are there to much column families to seek it fast?
>>>
>>> Thanks!
>>>
>>> Regards,
>>> Sven
>>>

Mime
View raw message