accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave <dlmar...@comcast.net>
Subject Re: Iterator as a Filter
Date Fri, 21 Oct 2016 00:16:46 GMT
<p dir="ltr">I'm a little confused to the use case here. Are you trying to find courses
that students are taking where the students are in a particular class? The table design is
going to depend on the set of questions that you want to answer. </p>
<div class="gmail_extra"><br><div class="gmail_quote">On Oct 20, 2016 7:19
PM, Yamini Joshi &lt;yamini.1691@gmail.com&gt; wrote:<br type="attribution"><blockquote
class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
dir="ltr">I did use the inverted index but I went into trouble because I used a batch scan
and it returns unsorted data. Also, I need to do some computation after.  Here is my prob
definition:<br /><div><br />The data is of the form:<br /></div><div>studentID
course|courseID [ ]  count<br />.<br />.<br />.<br />.<br />.<br
/>studentID np2| [ ]  count<br /><br /></div><div>So a student
is registered in multiple courses. The query has the following parameters:<br /></div><div>Input:
List of course Ids<br /></div><div>Output: Computation on records that contain
course from the I/p<br /></div><div>Algo: <br /></div><div>Step1:
Select rows that contain a course matching courses in the list<br /></div><div>Step2:
Count the number of such courses for each student<br /></div><div>Step3:
Do some computation <br /></div><div><br /></div><div>Approach1(Naive):<br
/></div><div>1. Designed a RowFilter that checks all the rowIds in the DB to
check if the course is in the course List<br /></div><div>2. Designed an
iterator to count the number of such courses within each student<br /></div><div>3.
Designed an iterator to do the computation<br /><br /></div><div>Problem:
Complexity &#61; O(n) where n&#61; number of records in the DB which is BAD.<br
/><br /></div><div>Approach2(Better Lookup):<br /></div><div>1.
Created an inverted Index with:<br /></div><div>courseID student|studentID
[ ] count<br />.<br />.<br />.<br />.<br /></div><div>2.
Looked up students for courses in the list<br /></div><div>3. Accessed records
with studentIDs, courseID generated from step1 using Range Object in batch scan<br /></div><div>4.
Designed an iterator to count courseIds within a student record<br /></div><div>5.
Designed an iterator to do the computation<br /><br /></div><div>Problem:
Batch scan does not return records in a sorted manner hence step 4 does not give me the required
results :\<br /><br /></div>I am not sure how to proceed now.<br /><br
/><br /></div><div><br clear="all" /><div><div><div
dir="ltr"><div>Best regards,<br />Yamini Joshi</div></div></div></div>
<br /><div class="elided-text">On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison
<span dir="ltr">&lt;<a href="mailto:dhutchis&#64;cs.washington.edu">dhutchis&#64;cs.washington.edu</a>&gt;</span>
wrote:<br /><blockquote style="margin:0 0 0 0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div
dir="ltr">Hi Yamini,<div><br /></div><div>If you have a finite,
known list of column families, you can use <a href="https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups">locality
groups</a> to store them in separate files in Hadoop.   Scans that only reference
the column families within a locality group need not open data in other locality groups&#39;
files.</div><div><br /></div><div>Apart from locality groups,
setting &#34;fetch column families and/or qualifiers&#34; on the scanner sets up a
standard Filter iterator on the scan.  If you need to obtain these columns from every row,
then the whole table is scanned and filtered server-side.  (Seeking will occur during the
scan if the selected columns are far apart in the table.)  I guess that is too inefficient
for your use case.  For reference, these iterators are <a href="https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java">here
for families</a> and <a href="https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java">here
for qualifiers</a>.</div><div><br /></div><div>If locality
groups are not an option and you must filter on families and columns, then you may want to
consider maintaining an index table, in which the columns are stored as rows, or otherwise
moving the columns into the rows.</div><div><br /></div><div>Regards,
Dylan</div></div><div><div><div><br /><div class="elided-text">On
Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <span dir="ltr">&lt;<a href="mailto:yamini.1691&#64;gmail.com">yamini.1691&#64;gmail.com</a>&gt;</span>
wrote:<br /><blockquote style="margin:0 0 0 0.8ex;border-left:1px #ccc solid;padding-left:1ex"><div
dir="ltr"><div><div>Hello all<br /><br /></div>Is it possible
to configure an iterator that works as a filter? As per Accumulo docs:<br />As such,
the &#96;Filter&#96; class functions well for filtering small amounts of data, but
is
      
      
        <table><tbody><tr><td>inefficient for filtering large amounts
of data. The decision to use a &#96;Filter&#96; strongly</td></tr><tr></tr></tbody></table>depends
on the use case and distribution of data being filtered.<br /><br /></div>I
have a huge corpus to be filtered with a small amount of data selected. I want to select column
families from a list of col families. I have a rough idea of using &#39;seek&#39;
to bypass cfs that don&#39;t exist in the list. I was hoping I could exploit the &#39;seek&#39;ing
in iterator and go to the range in the list of cf and check if it exists. I am not sure if
this will work or if it is a good approach. Any feedback is much appreciated.  <br /><div><div><br
clear="all" /><div><div><div><div dir="ltr"><div>Best regards,<br
/>Yamini Joshi</div></div></div></div>
</div></div></div></div>
</blockquote></div><br /></div>
</div></div></blockquote></div><br /></div>
</blockquote></div><br></div>
Mime
View raw message