accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vaibhav thapliyal <vaibhav.thapliyal...@gmail.com>
Subject Re: BatchScanner taking too much time to scan rows
Date Wed, 13 May 2015 17:56:39 GMT
Thank you Eric. I will surely do the same. Should uneven distribution
across the tablets affect querying in accumulo?  If this case, it is. Is
this behaviour normal?
On 13-May-2015 10:58 pm, "Eric Newton" <eric.newton@gmail.com> wrote:

> Yes, that's a great way to split the data evenly.
>
> Also, since the data set is so small, turn on data caching for your table:
>
> shell> config -t mytable -s table.cache.block.enable=true
>
> You may want to increase the size of your tserver JVM, and increase the
> size of the cache:
>
> shell> config -s tserver.cache.data.size=1G
>
> This will help with repeated random look-ups.
>
> -Eric
>
> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
> vaibhav.thapliyal.91@gmail.com> wrote:
>
>> Thank you Eric.
>>
>> One thing I would like to know. Does pre-splitting the data play a part
>> in querying accumulo?
>>
>> Because I managed to somewhat decrease the querying time.
>> I did the following steps:
>> My table was around 1.47gb so I explicity set the split parameter to
>> 256mb instead of the default 1gb.
>>
>> So I had just 8 tablets. Now when I carried out the same query, it
>> finished in 15s.
>>
>> Is it because of the split points are more evenly distributed?
>>
>> The previous table on which the query took 50s had entries unevenly
>> distributed across the tablets.
>> Thanks
>> Vaibhav
>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.newton@gmail.com> wrote:
>>
>>> This use case is one of the things Accumulo was designed to handle well.
>>> It's the reason there is a BatchScanner.
>>>
>>> I've created:
>>>
>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>
>>> so we can investigate and track down any problems or improvements.
>>>
>>> Feel free to add any other details to the JIRA ticket.
>>>
>>> -Eric
>>>
>>>
>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <elahrvivaz@ccri.com
>>> > wrote:
>>>
>>>>  It sounds like each of your ranges is an ID, e.g. a single row. I've
>>>> found that scanning lots of non-sequential single-row ranges is pretty slow
>>>> in accumulo. Your best approach is probably to create an index table on
>>>> whatever you are originally trying to query (assuming those 10000 ids came
>>>> from some other query).
>>>>
>>>> Thanks,
>>>>
>>>> Emilio
>>>>
>>>>
>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>
>>>>  The rf files per tablet vary between 2 to 5 per tablet. The entries
>>>> returned to me by the batchScanner is 460000. The approx. average data rate
>>>> is 0.5 MB/s as seen on the accumulo monitor page.
>>>>
>>>>  A simple scan on the table has an average data rate of about 7-8 MB/s.
>>>>
>>>>  All the ids exist in the accumulo table.
>>>>
>>>> On 12 May 2015 at 23:39, Keith Turner <keith@deenlo.com> wrote:
>>>>
>>>>> Do you know how much data is being brought back (i.e. 100 megabytes)?
>>>>> I am wondering what the data rate is in MB/s.  Do you know how many files
>>>>> per tablet you have?  Do most of the 10,000 ids you are querying for
exist?
>>>>>
>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>
>>>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <keith@deenlo.com>
wrote:
>>>>>>
>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>> parallelize operations within a tablet.
>>>>>>>
>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver
if the
>>>>>>> tserver has multiple tablets.  Each rpc may include multiple
tablets and
>>>>>>> ranges for each tablet.
>>>>>>>
>>>>>>>  If the batch scanner has less threads than tservers, it will
make
>>>>>>> one rpc per tserver per thread.  Each rpc call will include all
tablets and
>>>>>>> associated ranges for that tserver.
>>>>>>>
>>>>>>>  Keith
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table.
The
>>>>>>>> table has around 187m entries and I am using a 3 node cluster
which has
>>>>>>>> accumulo 1.6.1.
>>>>>>>>
>>>>>>>>  I have passed 10000 ids which are stored as row id in my
table as
>>>>>>>> a list in the setRanges() method.
>>>>>>>>
>>>>>>>>  This whole process takes around 50 secs(from adding the
ids in
>>>>>>>> the list to scanning the whole table using the BatchScanner).
>>>>>>>>
>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>
>>>>>>>>  Also if anyone could briefly explain how a BatchScanner
works,
>>>>>>>> how it does parallel scanning it would help me understand
what I am doing
>>>>>>>> better.
>>>>>>>>
>>>>>>>>  Thanks
>>>>>>>>  Vaibhav
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>

Mime
View raw message