accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Accumulo Utilities
Date Thu, 28 Mar 2013 18:46:57 GMT
So there are two use cases

1. We have a lot of users that are querying data, and each query opens a new BatchScanner.
As we would scale to more users simultaneously, we would have a lot of contention for threads
if we keep a constant number of threads per BatchScanner.

2. In the Rya work, we do interesting joins, and as we have more layers to the joins we create
an order of magnitude of batch scanners. Smart on my part… Trying to control the number
of threads created here as well.

Btw, the explanation of how the BatchScanner makes perfect sense. Really it is a lot smarter
on how to query based on tablets at the tablet servers. I have to check out the code in more
detail, especially the part of the tablets splitting while we scan. That sounds interesting.

Thanks Keith!


On Mar 28, 2013, at 2:32 PM, Keith Turner <> wrote:

> On Thu, Mar 28, 2013 at 2:00 PM,  <> wrote:
>> Yeah, that is why in the ThreadPoolConnector, I did not want to block ever. If the
pool is exhausted, then just make a different kind of BatchScanner, that doesn't spawn new
threads. Once the BatchScanner is closed, then release the threads. I can probably make a
ThreadPool implementation that does that, just returns only 1 thread if the pool is exhausted
and never block.
>> I did not want to spin up a new thread at all once the pool is exhausted, but from
what you are saying it is ok to really have a new thread. Instead of increasing the threads
used by 10+ with each batch scanner, I would just be increasing by 1, that isn't so bad.
> I am curious about the problem you are trying to solve.  Do you have
> too many active threads and thats causing thrashing?  Or do you end up
> with a lot of inactive threads eating up memory?
>> For binning of ranges, would it make more sense to add a server side iterator to
make sure the gaps do not come back. So it might go like this:
>> ranges = 1-2, 5-6, 7-8
>> Tablet servers Ranges: T1: 1-4, T2: 5-10
>> The ranges actually searched will be T1: 1-2, and T2: 5-8 (with a server side iterator
removing the ranges not included)
> It would probably be T1:1-2 and T2:5-6,7-8.   I assume T1 and T2
> represent tablets, and not tablet servers?
> Adding a server side iterator to a scanner that accepts a list of
> ranges would make it more like the batch scanner.  One difference is
> that you would need to do a scan per tablet (which is certainly better
> than a scan per range), passing each tablet the list of ranges that
> pertain to it.   The batch scanner sends all ranges for all tablets in
> one shot to a tablet server, so the batch scanner conceptually does a
> scan per tablet server(better than a scan per tablet).  The scanner
> will never operate on more than one tablet a time.    You would need
> to properly handle tablets splitting while you are scanning.
> The batch scanner also tracks which ranges are finished as it gets
> results backs.  This keeps it from having to redo work in the case
> where a tablet moves (because of migration, split, or tablet server
> failure).
>> What about the BatchScanner, doesn't it also binRanges, and then tell each tablet
server that it only cares about a subset of ranges. That way you only have your number of
ranges maxed at the number of tablet servers that have the ranges you asked for. Then each
tablet server knows exactly which ranges to return?
> I think I answered this question above.
>> Feel free to ignore the myriad of questions, it is interesting learning the inner
workings of the BatchScanner and Scanner.
>> Roshan
>> On Mar 28, 2013, at 1:15 PM, Keith Turner <> wrote:
>>> On Thu, Mar 28, 2013 at 12:15 PM,  <> wrote:
>>>> Thanks! I like the idea of sending my own thread pool to the batch scanner,
that would definitely be the better solution.
>>> Would you like to open a ticket about this issue?
>>> I just remembered, there is an issues w/ this approach to be aware of
>>> .  I have seen this when multiple threads share a batch scanner (more
>>> in this below).  Consider the following situation.
>>> 1. Thread A gives a lot of work to BatchScanner1 using Threadpool1,
>>> creating BatchScannerIterator1
>>> 2. BatchScannerIterator1's internal queue fills up as result of work
>>> given by Thread A
>>> 3. All threads in ThreadPool1 block trying to add to
>>> BatchScannerIterator1 queue
>>> 4. Thread B gives a lot of work to BatchScanner2 using Threadpool1,
>>> creating BatchScannerIterator2
>>> 5. Thread B attempts to iterate over BatchScannerIterator2, but
>>> blocks forever because no threads service it
>>> This problem occurs because Thread A never reads from BatchScannerIterator1
>>> In the current code, multiple threads can use a BatchScanner.  You
>>> just need to make configuring the BatchScanner and getting an iterator
>>> an atomic operation.   When an iterator is created by a batch scanner,
>>> it copies the config that exist at that point in time.  Changes to the
>>> BatchScanner config after an iterator is created, will not affect the
>>> iterator.
>>>> Yeah I thought about creating a batch scanner with only one thread, but I
was not sure if that is making a separate thread (outside of the current one) or using the
current one. At the time I did not want a new thread to be created at all. Though, didn't
realize the Scanner was also spinning up a thread at all, thought that was in process.
>>> The batch scanner will create a new thread pool w/ one thread.
>>>> To mitigate the separate RPC call per range, would it make more sense to
do a "binRanges" based on the ranges at the tablets to reduce the number of ranges?
>>> Probably do not want to combine ranges, that could bring back data in
>>> the gaps between ranges.
>>>> On Mar 28, 2013, at 11:55 AM, Keith Turner <> wrote:
>>>>> I took a quick look at the code. Excluding the threading issue, a
>>>>> major conceptual difference is that BatchScannerWithScanners seems to
>>>>> do a RPC round trip for each range.   The TabletServerBatchReader
>>>>> sends all of the ranges that a tablet server needs to lookup in one
>>>>> RPC.
>>>>> Instead of creating a BatchScannerWithScanners, maybe you could create
>>>>> a batch scanner with just one thread when resources are exceeded?
>>>>> This will be similar to what you are doing now, just one thread will
>>>>> be doing work fetching data.  The client thread would just be waiting
>>>>> on this background thread.   Although this does allow the processing
>>>>> of result to happen concurrently with fetching of data.  Using
>>>>> BatchScannerWithScanners would not allow this.
>>>>> Something to be aware of, the regular scanner will spin up a read
>>>>> ahead thread if you read a lot of data through it.  It does not do
>>>>> this immediately, only after fetching a few batches of key value pairs
>>>>> from the tablet server.  If this happens you could have one thread
>>>>> fetching data while the client thread processes results.
>>>>> Do you think we should open a a ticket about giving users control over
>>>>> threads created by client code?    Maybe users could pass in their own
>>>>> thread pool to a batch scanner?
>>>>> Keith
>>>>> On Thu, Mar 28, 2013 at 11:00 AM,  <> wrote:
>>>>>> In some of my projects, we needed to control the number of threads
spun up with the use of multiple batch scanners. We created a utility to control the number
of threads, and if the max threads has been reached, return a batch scanner that is actually
backed by Scanners. Wanted to get any feedback on the code. Seems like such a simple thing
to do, I bet someone already has this. Thanks!

View raw message