accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject Re: number of query threads for batch scanner
Date Fri, 28 Sep 2012 02:39:25 GMT
The threads used by the batch scanner is largely used for spreading
I/O to different servers.

If you have 50 matching ranges, and they are on 25 machines, and you
have 10 threads, you won't get much parallelism.

If you have 50 matching ranges, and they are on 2 machines, and you
have 10 threads, you will get parallel queries.

But if you need parallelism on your tablet server because your data
seems to be uneven (4 tablets on one server, but 1 each on 10 other
servers) perhaps you need a different balancing strategy.


On Wed, Sep 26, 2012 at 9:19 AM, ameet kini <> wrote:
> So I decided to try something different, and changed my splitting policy.
> This ended up with more tablets per tablet server. Interestingly, this
> bumped up my maximum concurrent scans on that tablet server. With about 19
> tablets, I was able to go up to 6 concurrent scans, which ended up using all
> my cores - happy! And I didn't change my numQueryThreads parameter from the
> already very high number.
> But that leaves me wondering whether the maximum number of concurrent scans
> on a given tablet server is related to the number of tablets hit by that
> scan on the tablet server. If true, that is interesting, and not what I'd
> expected. Given  that the underlying files are immutable, I'm not sure why
> there can't be, say, 4 concurrent scans on 1 tablet if there were 4 cores
> free to host those scans. What I'm seeing, as described above, is I need to
> further split my tablet into > 4 tablets in order to have 4 concurrent
> scans.
> Ameet
> On Tue, Sep 25, 2012 at 3:23 PM, ameet kini <> wrote:
>> I should also state the not-so-obvious that my Range spans the entire
>> range of the four tablets in question.
>> Ameet
>> On Tue, Sep 25, 2012 at 3:17 PM, ameet kini <> wrote:
>>> Thanks William.
>>> The issue here is that without knowing how the numQueryThreads translates
>>> to the number of concurrent scans, I cannot effectively tune that parameter
>>> to maximize resource usage on the tablet server. What I'm seeing is that
>>> even though there are four tablets on the tablet server, my number of
>>> concurrent scans never exceeds 3. This is despite setting numQueryThreads to
>>> a very high number and having 8 cores on the tablet server. I suspect with 3
>>> concurrent scans and no garbage collection happening at that moment, most of
>>> the cores are sitting idle.
>>> Ameet
>>> On Tue, Sep 25, 2012 at 3:08 PM, William Slacum
>>> <> wrote:
>>>> It should really be dependent upon the resources available to the
>>>> client. You can set an arbitrarily high number of threads, but you're still
>>>> bound by the number of parallel operations the CPU can make. I would assume
>>>> the sweet spot is somewhere around that number-- try doing a small bench
>>>> mark with 2, 4, 8, 16, etc threads and see where your performance starts
>>>> level off.
>>>> On Tue, Sep 25, 2012 at 11:45 AM, ameet kini <>
>>>> wrote:
>>>>> Probably worth adding that the table mentioned below has a bunch of
>>>>> tablets on other tablet servers as well, which is why I'm using
>>>>> BatchScanner. I'm just not sure how the numQueryThreads relates to the
>>>>> number of a concurrent scans on a given tablet server.
>>>>> Thanks
>>>>> On Tue, Sep 25, 2012 at 2:22 PM, ameet kini <>
>>>>> wrote:
>>>>>> I have a table with 4 tablets on a given tablet server. Depending
>>>>>> the numQueryThreads parameter below, I see a varying number of maximum
>>>>>> concurrent scans on that table. This maximum number varies from 1
to 3
>>>>>> (i.e., some values for numQueryThreads result in maximum concurrent
scan of
>>>>>> 1, some values result in 2 concurrent scans, etc.). Can someone shed
>>>>>> on what is the relationship between numQueryThreads and number of
>>>>>> scans?
>>>>>> public BatchScanner createBatchScanner(String tableName,
>>>>>>                                        Authorizations authorizations,
>>>>>>                                        int numQueryThreads)
>>>>>> A follow-on question would be what is general rule of thumb for
>>>>>> setting numQueryThreads? Should it be set to the  # of hosted tablets
>>>>>> expected to be consumed by that BatchScanner? Should it be the #
of tablet
>>>>>> servers expected to be hit by that BatchScanner? Something else?
>>>>>> Thanks,
>>>>>> Ameet

View raw message