accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Accumulo Utilities
Date Thu, 28 Mar 2013 17:15:26 GMT
On Thu, Mar 28, 2013 at 12:15 PM,  <> wrote:
> Thanks! I like the idea of sending my own thread pool to the batch scanner, that would
definitely be the better solution.

Would you like to open a ticket about this issue?

I just remembered, there is an issues w/ this approach to be aware of
.  I have seen this when multiple threads share a batch scanner (more
in this below).  Consider the following situation.

 1. Thread A gives a lot of work to BatchScanner1 using Threadpool1,
creating BatchScannerIterator1
 2. BatchScannerIterator1's internal queue fills up as result of work
given by Thread A
 3. All threads in ThreadPool1 block trying to add to
BatchScannerIterator1 queue
 4. Thread B gives a lot of work to BatchScanner2 using Threadpool1,
creating BatchScannerIterator2
 5. Thread B attempts to iterate over BatchScannerIterator2, but
blocks forever because no threads service it

This problem occurs because Thread A never reads from BatchScannerIterator1

In the current code, multiple threads can use a BatchScanner.  You
just need to make configuring the BatchScanner and getting an iterator
an atomic operation.   When an iterator is created by a batch scanner,
it copies the config that exist at that point in time.  Changes to the
BatchScanner config after an iterator is created, will not affect the

> Yeah I thought about creating a batch scanner with only one thread, but I was not sure
if that is making a separate thread (outside of the current one) or using the current one.
At the time I did not want a new thread to be created at all. Though, didn't realize the Scanner
was also spinning up a thread at all, thought that was in process.

The batch scanner will create a new thread pool w/ one thread.

> To mitigate the separate RPC call per range, would it make more sense to do a "binRanges"
based on the ranges at the tablets to reduce the number of ranges?

Probably do not want to combine ranges, that could bring back data in
the gaps between ranges.

> On Mar 28, 2013, at 11:55 AM, Keith Turner <> wrote:
>> I took a quick look at the code. Excluding the threading issue, a
>> major conceptual difference is that BatchScannerWithScanners seems to
>> do a RPC round trip for each range.   The TabletServerBatchReader
>> sends all of the ranges that a tablet server needs to lookup in one
>> RPC.
>> Instead of creating a BatchScannerWithScanners, maybe you could create
>> a batch scanner with just one thread when resources are exceeded?
>> This will be similar to what you are doing now, just one thread will
>> be doing work fetching data.  The client thread would just be waiting
>> on this background thread.   Although this does allow the processing
>> of result to happen concurrently with fetching of data.  Using
>> BatchScannerWithScanners would not allow this.
>> Something to be aware of, the regular scanner will spin up a read
>> ahead thread if you read a lot of data through it.  It does not do
>> this immediately, only after fetching a few batches of key value pairs
>> from the tablet server.  If this happens you could have one thread
>> fetching data while the client thread processes results.
>> Do you think we should open a a ticket about giving users control over
>> threads created by client code?    Maybe users could pass in their own
>> thread pool to a batch scanner?
>> Keith
>> On Thu, Mar 28, 2013 at 11:00 AM,  <> wrote:
>>> In some of my projects, we needed to control the number of threads spun up with
the use of multiple batch scanners. We created a utility to control the number of threads,
and if the max threads has been reached, return a batch scanner that is actually backed by
Scanners. Wanted to get any feedback on the code. Seems like such a simple thing to do, I
bet someone already has this. Thanks!

View raw message