accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <David.Sla...@jhuapl.edu>
Subject BatchScanning with a very large collection of ranges
Date Wed, 23 Jan 2013 18:24:36 GMT
First, thanks to everyone for their responses to my previous questions. (Mike, I'll definitely
take a look at Brian's materials for iterator behavior.)

Now I'm doing some sharded document querying (where the documents are small but numerous)-where
I'm trying to get not just the list of documents but also return all of them (they are also
stored in Accumulo). However, I'm running into a bottleneck in the retrieval process. It seems
that the BatchScanner is quite slow at retrieving information when there is a very large number
of (small) ranges (entries, i.e. docs), and increasing the thread count doesn't seem to help.

Basically, I'm taking all of the docIDs that are returned from the index process, making a
new Range(docID), adding that to Collection<Range> ranges, and then adding those ranges
to the new BatchScanner to return the information:

...
Collection<Range> docRanges = new LinkedList<Range>();
for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index table here
            Text docID = entry.getKey().getColumnQualifier();
            docRanges.add(new Range(docID));
}

int threadCount = 20;
String docTableName = "docTable";
BatchScanner docScanner = connector.createBatchScanner(docTableName, new Authorizations(),
threadCount);
docScanner.setRanges(docRanges); // large collection of ranges

for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
            ...
}
...

Is this a naïve way of doing this? Would trying to group documents into larger ranges (when
adjacent) be a more viable approach?

Thanks,
David

Mime
View raw message