accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Bob.Thor...@l-3com.com>
Subject RE: Improving Batchscanner Performance
Date Tue, 20 May 2014 17:56:39 GMT
David

I use this same pattern and have for the last couple of years.  What I have put in place is
a cache (redis) between the batchscanner thread that is reading the index tables and a separate
consumer thread that's doing the final lookup of the rowID's.  The rowID pool can grow as
it needs to in redis, then the consumer thread pulls whatever ID's it needs to do the final
scan to get my row data.  Tuning the batchsize for the scanner to a small number to match
the consumers needs has helped us keep the final scanner from blocking on too much data or
flooding the network with unneeded data between the server and client(s).

I would recommend you reconsider the used of UUID for rowID, we found a significant performance
improvement by using an organic rowID that fit the data/purpose better which orders the data
on fewer tablet servers most of the time.  The accumulation of more threads in a batch scanner
just because we used random rowID seems to have diminishing returns as the rowID's approach
randomness.

HTH

-----Original Message-----
From: Slater, David M. [mailto:David.Slater@jhuapl.edu] 
Sent: Tuesday, May 20, 2014 12:11 PM
To: user@accumulo.apache.org
Subject: Improving Batchscanner Performance

Hey everyone,

I'm trying to improve the query performance of batchscans on my data table. I first scan over
index tables, which returns a set of rowIDs that correspond to the records I am interested
in. This set of records is fairly randomly (and uniformly) distributed across a large number
of tablets, due to the randomness of the UID and the query itself. Then I want to scan over
my data table, which is setup as follows:
row     		colFam      	colQual     	value
rowUID  	 --          		--          		byte[] of data

These records are fairly small (100s of bytes), but numerous (I may return 50000 or more).
The method I use to obtain this follows. Essentially, I turn the rows returned from the first
query into a set of ranges to input into the batchscanner, and then return those rows, retrieving
the value from them. 

// returns the data associated with the given collection of rows
    public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType,
String tablename, int queryThreads) throws TableNotFoundException {
        List<byte[]> values = new ArrayList<byte[]>(rows.size());
        if (!rows.isEmpty()) {
            BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(),
queryThreads);
            List<Range> ranges = new ArrayList<Range>();
            for (Text row : rows) {
                ranges.add(new Range(row));
            }
            scanner.setRanges(ranges);
            for (Map.Entry<Key, Value> entry : scanner) {
                values.add(entry.getValue().get());
            }
            scanner.close();
        }
        return values;
    }

Is there a more efficient way to do this? I have index caches and bloom filters enabled (data
caches are not), but I still seem to have a long query lag. Any thoughts on how I can improve
this?

Thanks,
David

Mime
View raw message