accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Re: Iterator as a Filter
Date Thu, 20 Oct 2016 23:04:11 GMT
Hi Yamini,

If you have a finite, known list of column families, you can use locality
groups
<https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
store them in separate files in Hadoop.   Scans that only reference the
column families within a locality group need not open data in other
locality groups' files.

Apart from locality groups, setting "fetch column families and/or
qualifiers" on the scanner sets up a standard Filter iterator on the scan.
If you need to obtain these columns from every row, then the whole table is
scanned and filtered server-side.  (Seeking will occur during the scan if
the selected columns are far apart in the table.)  I guess that is too
inefficient for your use case.  For reference, these iterators are here for
families
<https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
and here for qualifiers
<https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
.

If locality groups are not an option and you must filter on families and
columns, then you may want to consider maintaining an index table, in which
the columns are stored as rows, or otherwise moving the columns into the
rows.

Regards, Dylan

On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <yamini.1691@gmail.com> wrote:

> Hello all
>
> Is it possible to configure an iterator that works as a filter? As per
> Accumulo docs:
> As such, the `Filter` class functions well for filtering small amounts of
> data, but is
> inefficient for filtering large amounts of data. The decision to use a
> `Filter` strongly
> depends on the use case and distribution of data being filtered.
>
> I have a huge corpus to be filtered with a small amount of data selected.
> I want to select column families from a list of col families. I have a
> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
> was hoping I could exploit the 'seek'ing in iterator and go to the range in
> the list of cf and check if it exists. I am not sure if this will work or
> if it is a good approach. Any feedback is much appreciated.
>
> Best regards,
> Yamini Joshi
>

Mime
View raw message