accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan LASKO <jonathan.la...@raytheon.com>
Subject using stateful iterators to do filtering during major compaction?
Date Mon, 10 Apr 2017 19:14:49 GMT
Hi Accumulo wizards,

TL;DR - this is a question about custom iterators and saving state (or seeking backwards)
in order to filter / mask data during major compaction.

For a project I'm working on, we would like to be able to use one entry to filter other entries
in the same row. (I will call the first entry the 'filtering key.') To do this, we would ensure
that this 'filtering key' lexicographically precedes the other entries it would be used on.

There is, of course, a "snag" with this idea: the iterator could simply read and save in memory
the entry and then use it for subsequent filtering, were it not for the fact that the iterator
stack can be dropped and re-initialized at any point in the row, including cf's/cq's that
are already past the 'filtering key.' Our understanding is that the tserver processes can
(and do!) restart and re-initialize the iterator stack at any point. When this happens, the
tserver will "seek(...)" the newly re-initialized iterator stack back to the same row/cf/cq
that the previous incarnation of the stack was on when it got re-initialized.

When this teardown/re-init happens, the tserver doesn't call deepCopy(...) on the iterator
stack; it just calls init(...). (At least, this is our experience in Accumulo 1.6.2.) For
this reason, it is seen as a risky proposition to try to keep state in the iterators. (Josh
Elser acknowledges this in his presentation on designing and testing custom iterators for
Accumulo, https://www.slideshare.net/je2451/designing-and-testing-accumulo-iterators).

Nevertheless, for the scantime scope, I believe we can use WholeRowIterator to ensure that
we don't ever return data for a row until we've read the entire row, thus avoiding the need
to keep state in the iterators. (If the iterator stack gets re-initialized, we should start
over from the beginning of the row.)

Our problem comes when we want to use this filter in majc.compaction scope to actually filter
the masked data out of the system entirely. In this case, the WholeRowIterator approach wouldn't
seem to be usable (because Accumulo only allows us to set filters for compaction time but
not iterators).

Here are our questions:

(1) Has Accumulo's behavior when tearing down and re-initializing an iterator stack changed
between 1.6.2 and the latest version? (I.e. is deepCopy now called?)

(2) Are there any other ways in which storing state across iterator stack teardowns has been
made any easier?

(3) If not, are there any other tricks/hacks which we might consider using (albeit with caution)
to store state or otherwise accomplish this? (Options we've mused about include figuring out
another way for the iterators to store state beyond themselves -- can iterators write to the
IteratorEnvironment to influence future iterator instantiations? -- and/or allowing the iterators
to seek backwards to get the 'filtering key' they need.)

(4) Also: any downsides to using the WholeRowIterator we should keep in mind?

Thanks in advance,

Jonathan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message