accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@mit.edu>
Subject Re: BatchScanner taking too much time to scan rows
Date Thu, 14 May 2015 17:33:10 GMT
I think this is the same issue I found for ACCUMULO-3710
<https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case the
tserver ran out of memory.  Accumulo doesn't handle large numbers of small,
disjoint ranges well.  I bet there's room for improvement on both the
client and tablet server.
~Dylan

On Wed, May 13, 2015 at 3:13 PM, Eric Newton <eric.newton@gmail.com> wrote:

> Yes, hot-spotting does affect accumulo because you have fewer servers and
> caches handling your request.
>
> Let's say your data is spread out, in a normal distribution from "0".."9".
>
> What if you have only 1 split?  You would want it at "5", to divide the
> data in half, and you could host the halves on different servers.  But if
> you split at 1, now 10% of your queries go to one tablet, and 90% go to the
> other.
>
> -Eric
>
>
> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
> vaibhav.thapliyal.91@gmail.com> wrote:
>
>> Thank you Eric. I will surely do the same. Should uneven distribution
>> across the tablets affect querying in accumulo?  If this case, it is. Is
>> this behaviour normal?
>> On 13-May-2015 10:58 pm, "Eric Newton" <eric.newton@gmail.com> wrote:
>>
>>> Yes, that's a great way to split the data evenly.
>>>
>>> Also, since the data set is so small, turn on data caching for your
>>> table:
>>>
>>> shell> config -t mytable -s table.cache.block.enable=true
>>>
>>> You may want to increase the size of your tserver JVM, and increase the
>>> size of the cache:
>>>
>>> shell> config -s tserver.cache.data.size=1G
>>>
>>> This will help with repeated random look-ups.
>>>
>>> -Eric
>>>
>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>
>>>> Thank you Eric.
>>>>
>>>> One thing I would like to know. Does pre-splitting the data play a part
>>>> in querying accumulo?
>>>>
>>>> Because I managed to somewhat decrease the querying time.
>>>> I did the following steps:
>>>> My table was around 1.47gb so I explicity set the split parameter to
>>>> 256mb instead of the default 1gb.
>>>>
>>>> So I had just 8 tablets. Now when I carried out the same query, it
>>>> finished in 15s.
>>>>
>>>> Is it because of the split points are more evenly distributed?
>>>>
>>>> The previous table on which the query took 50s had entries unevenly
>>>> distributed across the tablets.
>>>> Thanks
>>>> Vaibhav
>>>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.newton@gmail.com> wrote:
>>>>
>>>>> This use case is one of the things Accumulo was designed to handle
>>>>> well. It's the reason there is a BatchScanner.
>>>>>
>>>>> I've created:
>>>>>
>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>>
>>>>> so we can investigate and track down any problems or improvements.
>>>>>
>>>>> Feel free to add any other details to the JIRA ticket.
>>>>>
>>>>> -Eric
>>>>>
>>>>>
>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>>> elahrvivaz@ccri.com> wrote:
>>>>>
>>>>>>  It sounds like each of your ranges is an ID, e.g. a single row.
I've
>>>>>> found that scanning lots of non-sequential single-row ranges is pretty
slow
>>>>>> in accumulo. Your best approach is probably to create an index table
on
>>>>>> whatever you are originally trying to query (assuming those 10000
ids came
>>>>>> from some other query).
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Emilio
>>>>>>
>>>>>>
>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>>
>>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The entries
>>>>>> returned to me by the batchScanner is 460000. The approx. average
data rate
>>>>>> is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>>
>>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>>> MB/s.
>>>>>>
>>>>>>  All the ids exist in the accumulo table.
>>>>>>
>>>>>> On 12 May 2015 at 23:39, Keith Turner <keith@deenlo.com> wrote:
>>>>>>
>>>>>>> Do you know how much data is being brought back (i.e. 100
>>>>>>> megabytes)? I am wondering what the data rate is in MB/s.  Do
you know how
>>>>>>> many files per tablet you have?  Do most of the 10,000 ids you
are querying
>>>>>>> for exist?
>>>>>>>
>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create
the
>>>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <keith@deenlo.com>
wrote:
>>>>>>>>
>>>>>>>>>   How many tablets do you have?  The batch scanner does
not
>>>>>>>>> parallelize operations within a tablet.
>>>>>>>>>
>>>>>>>>>  If you give the batch scanner more threads than there
are
>>>>>>>>> tservers, it will make multilple parallel rpc calls to
each tserver if the
>>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple
tablets and
>>>>>>>>> ranges for each tablet.
>>>>>>>>>
>>>>>>>>>  If the batch scanner has less threads than tservers,
it will make
>>>>>>>>> one rpc per tserver per thread.  Each rpc call will include
all tablets and
>>>>>>>>> associated ranges for that tserver.
>>>>>>>>>
>>>>>>>>>  Keith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo
table. The
>>>>>>>>>> table has around 187m entries and I am using a 3
node cluster which has
>>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>>
>>>>>>>>>>  I have passed 10000 ids which are stored as row
id in my table
>>>>>>>>>> as a list in the setRanges() method.
>>>>>>>>>>
>>>>>>>>>>  This whole process takes around 50 secs(from adding
the ids in
>>>>>>>>>> the list to scanning the whole table using the BatchScanner).
>>>>>>>>>>
>>>>>>>>>>  I tried switching on bloom filters but that didn't
work.
>>>>>>>>>>
>>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner
works,
>>>>>>>>>> how it does parallel scanning it would help me understand
what I am doing
>>>>>>>>>> better.
>>>>>>>>>>
>>>>>>>>>>  Thanks
>>>>>>>>>>  Vaibhav
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Mime
View raw message