accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yamini Joshi <yamini.1...@gmail.com>
Subject Re: Iterator as a Filter
Date Fri, 21 Oct 2016 01:53:34 GMT
I have an input C which is the list of courses a student x is enrolled in.
I am trying to do some computation which requires 2 things:
For a student enrolled in atleast one of the courses in C
1. Total number of classes a student is enrolled in (Y)
2.  Number of courses the student is enrolled in which belong the list
cardinality(Y intersection C)


Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 7:16 PM, Dave <dlmarion@comcast.net> wrote:

> I'm a little confused to the use case here. Are you trying to find courses
> that students are taking where the students are in a particular class? The
> table design is going to depend on the set of questions that you want to
> answer.
>
> On Oct 20, 2016 7:19 PM, Yamini Joshi <yamini.1691@gmail.com> wrote:
>
> I did use the inverted index but I went into trouble because I used a
> batch scan and it returns unsorted data. Also, I need to do some
> computation after.  Here is my prob definition:
>
> The data is of the form:
> studentID course|courseID [ ]  count
> .
> .
> .
> .
> .
> studentID np2| [ ]  count
>
> So a student is registered in multiple courses. The query has the
> following parameters:
> Input: List of course Ids
> Output: Computation on records that contain course from the I/p
> Algo:
> Step1: Select rows that contain a course matching courses in the list
> Step2: Count the number of such courses for each student
> Step3: Do some computation
>
> Approach1(Naive):
> 1. Designed a RowFilter that checks all the rowIds in the DB to check if
> the course is in the course List
> 2. Designed an iterator to count the number of such courses within each
> student
> 3. Designed an iterator to do the computation
>
> Problem: Complexity = O(n) where n= number of records in the DB which is
> BAD.
>
> Approach2(Better Lookup):
> 1. Created an inverted Index with:
> courseID student|studentID [ ] count
> .
> .
> .
> .
> 2. Looked up students for courses in the list
> 3. Accessed records with studentIDs, courseID generated from step1 using
> Range Object in batch scan
> 4. Designed an iterator to count courseIds within a student record
> 5. Designed an iterator to do the computation
>
> Problem: Batch scan does not return records in a sorted manner hence step
> 4 does not give me the required results :\
>
> I am not sure how to proceed now.
>
>
>
> Best regards,
> Yamini Joshi
>
> On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison <
> dhutchis@cs.washington.edu> wrote:
>
> Hi Yamini,
>
> If you have a finite, known list of column families, you can use locality
> groups
> <https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
> store them in separate files in Hadoop.   Scans that only reference the
> column families within a locality group need not open data in other
> locality groups' files.
>
> Apart from locality groups, setting "fetch column families and/or
> qualifiers" on the scanner sets up a standard Filter iterator on the scan.
> If you need to obtain these columns from every row, then the whole table is
> scanned and filtered server-side.  (Seeking will occur during the scan if
> the selected columns are far apart in the table.)  I guess that is too
> inefficient for your use case.  For reference, these iterators are here
> for families
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
> and here for qualifiers
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
> .
>
> If locality groups are not an option and you must filter on families and
> columns, then you may want to consider maintaining an index table, in which
> the columns are stored as rows, or otherwise moving the columns into the
> rows.
>
> Regards, Dylan
>
> On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <yamini.1691@gmail.com>
> wrote:
>
> Hello all
>
> Is it possible to configure an iterator that works as a filter? As per
> Accumulo docs:
> As such, the `Filter` class functions well for filtering small amounts of
> data, but is
> inefficient for filtering large amounts of data. The decision to use a
> `Filter` strongly
> depends on the use case and distribution of data being filtered.
>
> I have a huge corpus to be filtered with a small amount of data selected.
> I want to select column families from a list of col families. I have a
> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
> was hoping I could exploit the 'seek'ing in iterator and go to the range in
> the list of cf and check if it exists. I am not sure if this will work or
> if it is a good approach. Any feedback is much appreciated.
>
> Best regards,
> Yamini Joshi
>
>
>
>
>

Mime
View raw message