accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yamini Joshi <>
Subject Re: Accumulo Equivalent of Mongo Aggr Query
Date Thu, 20 Oct 2016 15:23:59 GMT
Thank you for the reply Josh!

The data is of the form:
studentID course|courseID [ ]  count
studentID np2| [ ]  count

So a student is registered in multiple courses. The query has the following
Input: List of course Ids
Output: Computation on records that contain course from the I/p
Step1: Select rows that contain a course matching courses in the list
Step2: Count the number of such courses for each student
Step3: Do some computation

1. Designed a RowFilter that checks all the rowIds in the DB to check if
the course is in the course List
2. Designed an iterator to count the number of such courses within each
3. Designed an iterator to do the computation

Problem: Complexity = O(n) where n= number of records in the DB which is

Approach2(Better Lookup):
1. Created an inverted Index with:
courseID student|studentID [ ] count
2. Looked up students for courses in the list
3. Accessed records with studentIDs, courseID generated from step1 using
Range Object in batch scan
4. Designed an iterator to count courseIds within a student record
5. Designed an iterator to do the computation

Problem: Batch scan does not return records in a sorted manner hence step 4
does not give me the required results :\

I am not sure how to proceed now.

Best regards,
Yamini Joshi

On Mon, Sep 26, 2016 at 8:28 AM, Josh Elser <> wrote:

> I think I can understand what your query is doing, but, I'm just guessing
> too.
> What does your data in Accumulo look like? The only way I'm seeing that
> you would be able to implement this fully in Accumulo would be if your
> student_id is the leading component in the Accumulo rowId. The student_id
> anywhere else would require some multi-level computation (involving an
> additional aggregation client-side).
> Hoping that your data is in this form, a first implementation could be:
> 1. WholeRowIterator (collapse an entire row into one key-value pair)
> 2. Custom Filter (remove rows which do not match your criteria)
> 3. Custom transformation (permute the row into your 'np2' and 'shared'
> columns)
> Once you get the above working, there are a number of optimizations which
> you could do further (avoid serializing rows you're going to filter out or
> avoid the intermediate serialization entirely).
> Yamini Joshi wrote:
>> Hi Dylan
>> This is what I'm trying to do:
>> #groupby id and create 2 new columns: np2 and shared
>>   query = {'$group': {'_id': '$student_id', 'np2': {'$first': '$count'},
>> 'shared': {'$sum': 1}}}
>> The statement written above is one of the stages in a mongo aggregate
>> query. The results of allthe stages are computed on the server side and
>> the final result returned to the user.
>> My problem is: I can't figure out 2 things:
>> 1. How to add new columns while writing a Combiner/iterator
>> 2. How to do group by (based on a condition since data in accumulo is
>> always stored in a group).
>> Best regards,
>> Yamini Joshi
>> On Sun, Sep 25, 2016 at 5:18 PM, Dylan Hutchison
>> < <>> wrote:
>>     Hi Yamini,
>>     Could you further describe the computation you have in mind, for
>>     those of us not familiar with MongoDB's "Aggr" function?  You may
>>     want to look at Accumulo's built-in Combiner iterators
>>     <>.
>>     They seem more relevant than Filters.
>>     I don't know what you mean when you write that your output is not
>>     visible to "the complete Database".
>>     Regards, Dylan
>>     On Sun, Sep 25, 2016 at 11:34 AM, Yamini Joshi
>>     < <>> wrote:
>>         Hello everyone
>>         I wanted to know if there is any equivalent of Mongo Aggr
>>         queries in Acuumulo. I have a complex query in form of a Mongo
>>         aggregate (multi-staged) query. I'm trying to model the same in
>>         Accumulo. As of know, with the limited knowledge that I have, I
>>         have created a class extending Filter class. My question is:
>>         since my queries depend on a input, is there any other way of
>>         using the iterators/filters only for one query or change their
>>         input with every single query? As of now, my filter is getting
>>         attached to the table on 'SCAN' that means the output will be
>>         visible to the subsequent queries and not the complete Database.
>>         Best regards,
>>         Yamini Joshi

View raw message