accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: number of query threads for batch scanner
Date Fri, 28 Sep 2012 16:10:56 GMT
On Fri, Sep 28, 2012 at 9:35 AM, ameet kini <ameetkini@gmail.com> wrote:
>
> Thanks Eric and Keith.
>
> Is there any reason why the number of concurrent scans on a given tablet
> server depends on the number of tablets and not the number of cores on that
> tablet server? I'm looking at TabletServerBatchReaderIterator.doLookups.

Not really.  RFile has optimizations for seeking forward (ACCUMULO-473
has some numbers from an experiment I did).   So the ranges against an
individual tablet are sorted and seeked in order.   If you did break
up multiple ranges going to a single tablet, I think it would be best
to sort them and give threads sub-sequences of the sorted list to work
on.   This avoids multiple threads reading from the same rfile block
and doing redundant work to decode it.  Feel free to open a ticket to
explore this concept.

>
> Take Keith's example:
>
>  * For 1000 ranges that map to 1 tablet, it will execute 1 concurrent scan.
>
> Say, I had 8 cores on that tablet server and my tablet is large enough to
> warrant 8 concurrent scans. Sure, I can go about and further split my
> tablet, and get 8 concurrent scans - I ended up doing that. But is there any
> reason why 8 concurrent scans can't go against a single tablet? Maybe its
> difficult to estimate benefits of parallelism at that level, and its best
> left to users to tune the number of tablets, and base the level of
> parallelism on the number of tablets?
>
> Btw, the shell utility "merge -s <size>" rocks :)
>
> Thanks,
> Ameet
>
>
> On Fri, Sep 28, 2012 at 8:04 AM, Keith Turner <keith@deenlo.com> wrote:
>>
>> On Tue, Sep 25, 2012 at 3:17 PM, ameet kini <ameetkini@gmail.com> wrote:
>> > Thanks William.
>> >
>> > The issue here is that without knowing how the numQueryThreads
>> > translates to
>> > the number of concurrent scans, I cannot effectively tune that parameter
>> > to
>> > maximize resource usage on the tablet server. What I'm seeing is that
>> > even
>> > though there are four tablets on the tablet server, my number of
>> > concurrent
>> > scans never exceeds 3. This is despite setting numQueryThreads to a very
>> > high number and having 8 cores on the tablet server. I suspect with 3
>> > concurrent scans and no garbage collection happening at that moment,
>> > most of
>> > the cores are sitting idle.
>> >
>> > Ameet
>>
>> The amount if parallelism is determined by how your ranges map to
>> tablets. Below are some examples.
>>
>>  * For one range that maps to 10 tablets on 10 tablets severs, it will
>> execute 10 concurrent scans if numQueryThreads is >= 10.
>>  * For 1000 ranges that map to 10 tablets on 10 tablet servers, it
>> will execute 10 concurrent scans if numQueryThreads is >= 10.
>>  * For 1000 ranges that map to 10 tablets on 10 tablet servers, it
>> will execute 5 concurrent scans if numQueryThreads is 5.
>>  * For 1000 ranges that map to 1 tablet, it will execute 1 concurrent
>> scan.
>>
>> If you have more query threads than tablet server, the client code
>> will try to execute concurrent scans on a single tablet server.
>>
>> You can look at TabletServerBatchReaderIterator.doLookups() for the
>> details.  In this method it creates QueryTask objects and places them
>> on a thread pool.  The size of the thread pool is the user specified
>> numQueryThreads.
>>
>> >
>> > On Tue, Sep 25, 2012 at 3:08 PM, William Slacum
>> > <wilhelm.von.cloud@accumulo.net> wrote:
>> >>
>> >> It should really be dependent upon the resources available to the
>> >> client.
>> >> You can set an arbitrarily high number of threads, but you're still
>> >> bound by
>> >> the number of parallel operations the CPU can make. I would assume the
>> >> sweet
>> >> spot is somewhere around that number-- try doing a small bench mark
>> >> with 2,
>> >> 4, 8, 16, etc threads and see where your performance starts to level
>> >> off.
>> >>
>> >>
>> >> On Tue, Sep 25, 2012 at 11:45 AM, ameet kini <ameetkini@gmail.com>
>> >> wrote:
>> >>>
>> >>> Probably worth adding that the table mentioned below has a bunch of
>> >>> tablets on other tablet servers as well, which is why I'm using
>> >>> BatchScanner. I'm just not sure how the numQueryThreads relates to the
>> >>> number of a concurrent scans on a given tablet server.
>> >>>
>> >>> Thanks
>> >>>
>> >>>
>> >>> On Tue, Sep 25, 2012 at 2:22 PM, ameet kini <ameetkini@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>
>> >>>> I have a table with 4 tablets on a given tablet server. Depending
on
>> >>>> the
>> >>>> numQueryThreads parameter below, I see a varying number of maximum
>> >>>> concurrent scans on that table. This maximum number varies from
1 to
>> >>>> 3
>> >>>> (i.e., some values for numQueryThreads result in maximum concurrent
>> >>>> scan of
>> >>>> 1, some values result in 2 concurrent scans, etc.). Can someone
shed
>> >>>> light
>> >>>> on what is the relationship between numQueryThreads and number of
>> >>>> concurrent
>> >>>> scans?
>> >>>>
>> >>>> public BatchScanner createBatchScanner(String tableName,
>> >>>>                                        Authorizations authorizations,
>> >>>>                                        int numQueryThreads)
>> >>>>
>> >>>> A follow-on question would be what is general rule of thumb for
>> >>>> setting
>> >>>> numQueryThreads? Should it be set to the  # of hosted tablets
>> >>>> expected to be
>> >>>> consumed by that BatchScanner? Should it be the # of tablet servers
>> >>>> expected
>> >>>> to be hit by that BatchScanner? Something else?
>> >>>>
>> >>>> Thanks,
>> >>>> Ameet
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>
>

Mime
View raw message