accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <>
Subject Improving Batchscanner Performance
Date Tue, 20 May 2014 17:10:56 GMT
Hey everyone,

I'm trying to improve the query performance of batchscans on my data table. I first scan over
index tables, which returns a set of rowIDs that correspond to the records I am interested
in. This set of records is fairly randomly (and uniformly) distributed across a large number
of tablets, due to the randomness of the UID and the query itself. Then I want to scan over
my data table, which is setup as follows:
row     		colFam      	colQual     	value
rowUID  	 --          		--          		byte[] of data

These records are fairly small (100s of bytes), but numerous (I may return 50000 or more).
The method I use to obtain this follows. Essentially, I turn the rows returned from the first
query into a set of ranges to input into the batchscanner, and then return those rows, retrieving
the value from them. 

// returns the data associated with the given collection of rows
    public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType,
String tablename, int queryThreads) throws TableNotFoundException {
        List<byte[]> values = new ArrayList<byte[]>(rows.size());
        if (!rows.isEmpty()) {
            BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(),
            List<Range> ranges = new ArrayList<Range>();
            for (Text row : rows) {
                ranges.add(new Range(row));
            for (Map.Entry<Key, Value> entry : scanner) {
        return values;

Is there a more efficient way to do this? I have index caches and bloom filters enabled (data
caches are not), but I still seem to have a long query lag. Any thoughts on how I can improve


View raw message