accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: Accumulo Seek performance
Date Mon, 12 Sep 2016 17:20:07 GMT
Sorry, Monday morning poor reading skills, I guess. :)

So, 3000 ranges in 40 seconds with the BatchScanner. In my past experience
HDFS seeks tend to take something like 10-100ms, and I would expect that
time to dominate here. With 60 client threads your bottleneck should be the
readahead pool, which I believe defaults to 16 threads. If you get perfect
index caching then you should be seeing something like 3000/16*50ms =
9,375ms. That's in the right ballpark, but it assumes no data cache hits.
Do you have any idea of how many files you had per tablet after the ingest?
Do you know what your cache hit rate was?

Adam


On Mon, Sep 12, 2016 at 9:14 AM, Josh Elser <josh.elser@gmail.com> wrote:

> 5 iterations, figured that would be apparent from the log messages :)
>
> The code is already posted in my original message.
>
> Adam Fuchs wrote:
>
>> Josh,
>>
>> Two questions:
>>
>> 1. How many iterations did you do? I would like to see an absolute
>> number of lookups per second to compare against other observations.
>>
>> 2. Can you post your code somewhere so I can run it?
>>
>> Thanks,
>> Adam
>>
>>
>> On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser <josh.elser@gmail.com
>> <mailto:josh.elser@gmail.com>> wrote:
>>
>>     Sven, et al:
>>
>>     So, it would appear that I have been able to reproduce this one
>>     (better late than never, I guess...). tl;dr Serially using Scanners
>>     to do point lookups instead of a BatchScanner is ~20x faster. This
>>     sounds like a pretty serious performance issue to me.
>>
>>     Here's a general outline for what I did.
>>
>>     * Accumulo 1.8.0
>>     * Created a table with 1M rows, each row with 10 columns using YCSB
>>     (workloada)
>>     * Split the table into 9 tablets
>>     * Computed the set of all rows in the table
>>
>>     For a number of iterations:
>>     * Shuffle this set of rows
>>     * Choose the first N rows
>>     * Construct an equivalent set of Ranges from the set of Rows,
>>     choosing a random column (0-9)
>>     * Partition the N rows into X collections
>>     * Submit X tasks to query one partition of the N rows (to a thread
>>     pool with X fixed threads)
>>
>>     I have two implementations of these tasks. One, where all ranges in
>>     a partition are executed via one BatchWriter. A second where each
>>     range is executed in serial using a Scanner. The numbers speak for
>>     themselves.
>>
>>     ** BatchScanners **
>>     2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled
>>     all rows
>>     2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All
>>     ranges calculated: 3000 ranges found
>>     2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 40178 ms
>>     2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 42296 ms
>>     2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 46094 ms
>>     2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 47704 ms
>>     2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 49221 ms
>>
>>     ** Scanners **
>>     2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled
>>     all rows
>>     2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All
>>     ranges calculated: 3000 ranges found
>>     2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 2833 ms
>>     2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 2536 ms
>>     2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 2150 ms
>>     2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 2061 ms
>>     2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO :
>>     Executing 6 range partitions using a pool of 6 threads
>>     2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
>>     executed in 2140 ms
>>
>>     Query code is available
>>     https://github.com/joshelser/accumulo-range-binning
>>     <https://github.com/joshelser/accumulo-range-binning>
>>
>>
>>     Sven Hodapp wrote:
>>
>>         Hi Keith,
>>
>>         I've tried it with 1, 2 or 10 threads. Unfortunately there where
>>         no amazing differences.
>>         Maybe it's a problem with the table structure? For example it
>>         may happen that one row id (e.g. a sentence) has several
>>         thousand column families. Can this affect the seek performance?
>>
>>         So for my initial example it has about 3000 row ids to seek,
>>         which will return about 500k entries. If I filter for specific
>>         column families (e.g. a document without annotations) it will
>>         return about 5k entries, but the seek time will only be halved.
>>         Are there to much column families to seek it fast?
>>
>>         Thanks!
>>
>>         Regards,
>>         Sven
>>
>>
>>

Mime
View raw message