accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Accumulo Seek performance
Date Mon, 12 Sep 2016 18:05:19 GMT
Note  I was running a single tserver, datanode, and zookeeper on my workstation.

On Mon, Sep 12, 2016 at 2:02 PM, Keith Turner <keith@deenlo.com> wrote:
> Josh helped me get up and running w/ YCSB and I Am seeing very
> different results.   I am going to make a pull req to Josh's GH repo
> to add a Readme w/ what I learned from Josh in IRC.
>
> The link below is the Accumulo config I used for running a local 1.8.0 instance.
>
> https://gist.github.com/keith-turner/4678a0aac2a2a0e240ea5d73285743ab
>
> I created splits user1~ user2~ user3~ user4~ user5~ user6~ user7~
> user8~ user9~ AND then compacted the table.
>
> Below is the performance I saw with a single batch scanner (configured
> 1 partition).  The batch scanner has 10 threads.
>
> 2016-09-12 12:36:41,079 [client.ClientConfiguration] WARN : Found no
> client.conf in default paths. Using default client configuration
> values.
> 2016-09-12 12:36:41,428 [joshelser.YcsbBatchScanner] INFO : Connected
> to Accumulo
> 2016-09-12 12:36:41,429 [joshelser.YcsbBatchScanner] INFO : Computing ranges
> 2016-09-12 12:36:48,059 [joshelser.YcsbBatchScanner] INFO : Calculated
> all rows: Found 1000000 rows
> 2016-09-12 12:36:48,096 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows
> 2016-09-12 12:36:48,116 [joshelser.YcsbBatchScanner] INFO : All ranges
> calculated: 3000 ranges found
> 2016-09-12 12:36:48,118 [joshelser.YcsbBatchScanner] INFO : Executing
> 1 range partitions using a pool of 1 threads
> 2016-09-12 12:36:49,372 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1252 ms
> 2016-09-12 12:36:49,372 [joshelser.YcsbBatchScanner] INFO : Executing
> 1 range partitions using a pool of 1 threads
> 2016-09-12 12:36:50,561 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1188 ms
> 2016-09-12 12:36:50,561 [joshelser.YcsbBatchScanner] INFO : Executing
> 1 range partitions using a pool of 1 threads
> 2016-09-12 12:36:51,741 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1179 ms
> 2016-09-12 12:36:51,741 [joshelser.YcsbBatchScanner] INFO : Executing
> 1 range partitions using a pool of 1 threads
> 2016-09-12 12:36:52,974 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1233 ms
> 2016-09-12 12:36:52,974 [joshelser.YcsbBatchScanner] INFO : Executing
> 1 range partitions using a pool of 1 threads
> 2016-09-12 12:36:54,146 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1171 ms
>
> Below is the performance I saw with 6 batch scanners. Each batch
> scanner has 10 threads.
>
> 2016-09-12 13:58:21,061 [client.ClientConfiguration] WARN : Found no
> client.conf in default paths. Using default client configuration
> values.
> 2016-09-12 13:58:21,380 [joshelser.YcsbBatchScanner] INFO : Connected
> to Accumulo
> 2016-09-12 13:58:21,381 [joshelser.YcsbBatchScanner] INFO : Computing ranges
> 2016-09-12 13:58:28,571 [joshelser.YcsbBatchScanner] INFO : Calculated
> all rows: Found 1000000 rows
> 2016-09-12 13:58:28,606 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows
> 2016-09-12 13:58:28,632 [joshelser.YcsbBatchScanner] INFO : All ranges
> calculated: 3000 ranges found
> 2016-09-12 13:58:28,634 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 13:58:30,273 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1637 ms
> 2016-09-12 13:58:30,273 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 13:58:31,883 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1609 ms
> 2016-09-12 13:58:31,883 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 13:58:33,422 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1539 ms
> 2016-09-12 13:58:33,422 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 13:58:34,994 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1571 ms
> 2016-09-12 13:58:34,994 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 13:58:36,512 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 1517 ms
>
> Below is the performance I saw with 6 threads each using a scanner.
>
> 2016-09-12 14:01:14,972 [client.ClientConfiguration] WARN : Found no
> client.conf in default paths. Using default client configuration
> values.
> 2016-09-12 14:01:15,287 [joshelser.YcsbBatchScanner] INFO : Connected
> to Accumulo
> 2016-09-12 14:01:15,288 [joshelser.YcsbBatchScanner] INFO : Computing ranges
> 2016-09-12 14:01:22,309 [joshelser.YcsbBatchScanner] INFO : Calculated
> all rows: Found 1000000 rows
> 2016-09-12 14:01:22,352 [joshelser.YcsbBatchScanner] INFO : Shuffled all rows
> 2016-09-12 14:01:22,373 [joshelser.YcsbBatchScanner] INFO : All ranges
> calculated: 3000 ranges found
> 2016-09-12 14:01:22,376 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 14:01:25,696 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 3318 ms
> 2016-09-12 14:01:25,696 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 14:01:29,001 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 3305 ms
> 2016-09-12 14:01:29,001 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 14:01:31,824 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2822 ms
> 2016-09-12 14:01:31,824 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 14:01:34,207 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2383 ms
> 2016-09-12 14:01:34,207 [joshelser.YcsbBatchScanner] INFO : Executing
> 6 range partitions using a pool of 6 threads
> 2016-09-12 14:01:36,548 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2340 ms
>
> On Sat, Sep 10, 2016 at 6:01 PM, Josh Elser <josh.elser@gmail.com> wrote:
>> Sven, et al:
>>
>> So, it would appear that I have been able to reproduce this one (better late
>> than never, I guess...). tl;dr Serially using Scanners to do point lookups
>> instead of a BatchScanner is ~20x faster. This sounds like a pretty serious
>> performance issue to me.
>>
>> Here's a general outline for what I did.
>>
>> * Accumulo 1.8.0
>> * Created a table with 1M rows, each row with 10 columns using YCSB
>> (workloada)
>> * Split the table into 9 tablets
>> * Computed the set of all rows in the table
>>
>> For a number of iterations:
>> * Shuffle this set of rows
>> * Choose the first N rows
>> * Construct an equivalent set of Ranges from the set of Rows, choosing a
>> random column (0-9)
>> * Partition the N rows into X collections
>> * Submit X tasks to query one partition of the N rows (to a thread pool with
>> X fixed threads)
>>
>> I have two implementations of these tasks. One, where all ranges in a
>> partition are executed via one BatchWriter. A second where each range is
>> executed in serial using a Scanner. The numbers speak for themselves.
>>
>> ** BatchScanners **
>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>> rows
>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
>> calculated: 3000 ranges found
>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 40178 ms
>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 42296 ms
>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 46094 ms
>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 47704 ms
>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 49221 ms
>>
>> ** Scanners **
>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
>> rows
>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
>> calculated: 3000 ranges found
>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 2833 ms
>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 2536 ms
>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 2150 ms
>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 2061 ms
>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
>> range partitions using a pool of 6 threads
>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries executed
>> in 2140 ms
>>
>> Query code is available https://github.com/joshelser/accumulo-range-binning
>>
>>
>> Sven Hodapp wrote:
>>>
>>> Hi Keith,
>>>
>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
>>> amazing differences.
>>> Maybe it's a problem with the table structure? For example it may happen
>>> that one row id (e.g. a sentence) has several thousand column families. Can
>>> this affect the seek performance?
>>>
>>> So for my initial example it has about 3000 row ids to seek, which will
>>> return about 500k entries. If I filter for specific column families (e.g. a
>>> document without annotations) it will return about 5k entries, but the seek
>>> time will only be halved.
>>> Are there to much column families to seek it fast?
>>>
>>> Thanks!
>>>
>>> Regards,
>>> Sven
>>>
>>

Mime
View raw message