accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vaibhav thapliyal <vaibhav.thapliyal...@gmail.com>
Subject Re: BatchScanner taking too much time to scan rows
Date Wed, 13 May 2015 13:14:21 GMT
The rf files per tablet vary between 2 to 5 per tablet. The entries
returned to me by the batchScanner is 460000. The approx. average data rate
is 0.5 MB/s as seen on the accumulo monitor page.

A simple scan on the table has an average data rate of about 7-8 MB/s.

All the ids exist in the accumulo table.

On 12 May 2015 at 23:39, Keith Turner <keith@deenlo.com> wrote:

> Do you know how much data is being brought back (i.e. 100 megabytes)? I am
> wondering what the data rate is in MB/s.  Do you know how many files per
> tablet you have?  Do most of the 10,000 ids you are querying for exist?
>
> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
> vaibhav.thapliyal.91@gmail.com> wrote:
>
>> I have 194 tablets. Currently I am using 20 threads to create the
>> batchscanner inside the createBatchScanner method.
>> On 12-May-2015 11:19 pm, "Keith Turner" <keith@deenlo.com> wrote:
>>
>>> How many tablets do you have?  The batch scanner does not parallelize
>>> operations within a tablet.
>>>
>>> If you give the batch scanner more threads than there are tservers, it
>>> will make multilple parallel rpc calls to each tserver if the tserver has
>>> multiple tablets.  Each rpc may include multiple tablets and ranges for
>>> each tablet.
>>>
>>> If the batch scanner has less threads than tservers, it will make one
>>> rpc per tserver per thread.  Each rpc call will include all tablets and
>>> associated ranges for that tserver.
>>>
>>> Keith
>>>
>>>
>>>
>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using BatchScanner to scan rows from a accumulo table. The table
>>>> has around 187m entries and I am using a 3 node cluster which has accumulo
>>>> 1.6.1.
>>>>
>>>> I have passed 10000 ids which are stored as row id in my table as a
>>>> list in the setRanges() method.
>>>>
>>>> This whole process takes around 50 secs(from adding the ids in the list
>>>> to scanning the whole table using the BatchScanner).
>>>>
>>>> I tried switching on bloom filters but that didn't work.
>>>>
>>>> Also if anyone could briefly explain how a BatchScanner works, how it
>>>> does parallel scanning it would help me understand what I am doing better.
>>>>
>>>> Thanks
>>>> Vaibhav
>>>>
>>>>
>>>>
>>>
>

Mime
View raw message