accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject RE: Improving Batchscanner Performance
Date Tue, 20 May 2014 17:51:33 GMT
Hi Josh,

I should have clarified - I am using a batchscanner for both lookups. I had thought of putting
it into two different threads, but the first scan is typically an order of magnitude faster
than the second.

The logic for upperbounding the results returned is outside of the method I provided. Since
there is a one-to-one relationship between rowIDs and records on the second scan, I just limit
the number of rows I send to this method. 

As for blocking, I'm not sure exactly what you mean. I complete the first scan in its entirety,
which  before entering this method with the collection of Text rowIDs. The method for that
is:

public Collection<Text> getRowIDs(Collection<Range> ranges, Text term, String
tablename, int queryThreads, int limit) throws TableNotFoundException {
        Set<Text> guids = new HashSet<Text>();
        if (!ranges.isEmpty()) {
            BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(),
queryThreads);
            scanner.setRanges(ranges);
            scanner.fetchColumnFamily(term);
            for (Map.Entry<Key, Value> entry : scanner) {
                guids.add(entry.getKey().getColumnQualifier());
                if (guids.size() > limit) {
                    return null;
                }
            }
            scanner.close();
        }
        return guids;
    }

Essentially, my query does:
Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new Text("index"),
"mytable", 10, 10000);
Collection<byte[]> data = getRowData(rows, "mytable", 10);


-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Tuesday, May 20, 2014 1:32 PM
To: user@accumulo.apache.org
Subject: Re: Improving Batchscanner Performance

Hi David,

Absolutely. What you have here is a classic producer-consumer model. 
Your BatchScanner is producing results, which you then consume by your scanner, and ultimately
return those results to the client.

The problem with your below implementation is that you're not going to be polling your batchscanner
as aggressively as you could be. You are blocking while you can fetch each of those new Ranges
from the Scanner before fetching new ranges. Have you considered splitting up the BatchScanner
and Scanner code into two different threads?

You could easily use a ArrayBlockingQueue (or similar) to pass results from the BatchScanner
to the Scanner. I would imagine that this would give you a fair improvement in performance.

Also, it doesn't appear that there's a reason you can't use a BatchScanner for both lookups?

One final warning, your current implementation could also hog heap very badly if your batchscanner
returns too many records. The producer/consumer I proposed should help here a little bit,
but you should still be asserting upper-bounds to avoid running out of heap space in your
client.

On 5/20/14, 1:10 PM, Slater, David M. wrote:
> Hey everyone,
>
> I'm trying to improve the query performance of batchscans on my data table. I first scan
over index tables, which returns a set of rowIDs that correspond to the records I am interested
in. This set of records is fairly randomly (and uniformly) distributed across a large number
of tablets, due to the randomness of the UID and the query itself. Then I want to scan over
my data table, which is setup as follows:
> row     		colFam      	colQual     	value
> rowUID  	 --          		--          		byte[] of data
>
> These records are fairly small (100s of bytes), but numerous (I may return 50000 or more).
The method I use to obtain this follows. Essentially, I turn the rows returned from the first
query into a set of ranges to input into the batchscanner, and then return those rows, retrieving
the value from them.
>
> // returns the data associated with the given collection of rows
>      public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType,
String tablename, int queryThreads) throws TableNotFoundException {
>          List<byte[]> values = new ArrayList<byte[]>(rows.size());
>          if (!rows.isEmpty()) {
>              BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(),
queryThreads);
>              List<Range> ranges = new ArrayList<Range>();
>              for (Text row : rows) {
>                  ranges.add(new Range(row));
>              }
>              scanner.setRanges(ranges);
>              for (Map.Entry<Key, Value> entry : scanner) {
>                  values.add(entry.getValue().get());
>              }
>              scanner.close();
>          }
>          return values;
>      }
>
> Is there a more efficient way to do this? I have index caches and bloom filters enabled
(data caches are not), but I still seem to have a long query lag. Any thoughts on how I can
improve this?
>
> Thanks,
> David
>

Mime
View raw message