accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Thomas <max.tho...@jhu.edu>
Subject map-reduce workflows given index tables
Date Wed, 12 Aug 2015 14:04:09 GMT
Suppose you have a workflow like the following (hopefully not too 
uncommon):

* "main table" - billions of rows, each with order of magnitude 100 columns
* Ternary classifier that produces an annotation on each row in the main 
table. Suppose the labels are A, B, and C. Additionally, this analytic 
adds all labels to an index table, which is of the form:

label : rowId

to facilitate lookups of a particular type.

Now, suppose you want to run another analytic over all rows with label 
A, preferably using MapReduce. It seems the options are:

1. Create a scanner which retrieves all As from the index table; add 
these row IDs to an AccumuloInputFormat job; launch a MapReduce job with 
a single map phase. Con: driver program will need a large amount of 
memory to hold all rows for the range list.

2. A MapReduce job over the index table, with a Reduce phase where each 
reducer has a collection of row IDs to iterate over. Each reducer then 
retrieves its assigned rows and runs over them.

3. Run over the entire main table with a naive filter to check 
classification type. Cons: hits every row, many of which aren't going to 
match.

4. AccumuloMultiTableFormat, Filters/Iterators - don't seem appropriate here

It seems option #2 is ideal, with option #1 possibly working out too. 
But, I want to make sure I'm not missing something, as it doesn't seem 
possible to set up a workflow where the index table is hit, row IDs are 
retrieved, and these are then passed to another MapReduce job capable of 
hitting a different table  via MapReduce (obviously one could create a 
BatchScanner given the inputs anywhere). Are there any examples that 
cover this? Or does anyone have a few suggestions about how to set up 
such a workflow?

Another answer might very well be: this is a wacky table/indexing setup, 
which I am very amenable to hearing. But to a naive Accumulo user, 
having an index table seems OK - I think it is also covered in the manual.

Mime
View raw message