accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Distinguishing between processed and unprocessed data in an Iterator
Date Tue, 30 Sep 2014 22:27:41 GMT
Hi, folks,

The StatsCombiner[1] shows one way for an Iterator to distinguish between
processed and unprocessed data. In this case, the StatsCombiner treats
string representations of integers as unprocessed data and comma-separated
string representations of integers as processed data.

Two questions: First, is it possible to do this in an arbitrary fashion?
For example, let's say my Iterator adds Values to a bloom filter which it
maintains internally - like a combiner, but potentially across multiple
CF's. If the iterator encounters unprocessed data, it should offer it to
the bloom filter. If it encounters processed data (ie. a bloom filter), it
should merge it with its own bloom filter.

The only way that I can think of to do this is to have a higher-priority
iterator that "escapes" Values, and have my Iterator emit unescaped Values.
Then my iterator can make decisions based on whether a current Value is or
isn't escaped. I find this approach pretty kludgy though, and any advice is
welcome.

Second question: the need to distinguish between processed and unprocessed
data, is this due to the Iterator running in all three scopes? Would a
per-scanner Iterator or an Iterator running in scan scope be guaranteed to
only see unprocessed data?

Thanks,
-Russ

1:
https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java

Mime
View raw message