accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Distinguishing between processed and unprocessed data in an Iterator
Date Tue, 30 Sep 2014 22:46:19 GMT
On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rweeks@newbrightidea.com>
wrote:

> Hi, folks,
>
> The StatsCombiner[1] shows one way for an Iterator to distinguish between
> processed and unprocessed data. In this case, the StatsCombiner treats
> string representations of integers as unprocessed data and comma-separated
> string representations of integers as processed data.
>
> Two questions: First, is it possible to do this in an arbitrary fashion?
> For example, let's say my Iterator adds Values to a bloom filter which it
> maintains internally - like a combiner, but potentially across multiple
> CF's. If the iterator encounters unprocessed data, it should offer it to
> the bloom filter. If it encounters processed data (ie. a bloom filter), it
> should merge it with its own bloom filter.
>
> The only way that I can think of to do this is to have a higher-priority
> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
> Then my iterator can make decisions based on whether a current Value is or
> isn't escaped. I find this approach pretty kludgy though, and any advice is
> welcome.
>
>
Sure, you could generalize this, like standardize the way you flag data as
evaluated. However, I think most people would interpret "evaluated" to mean
"evaluated by this specific iterator", which would imply that the flagging
is iterator-specific.


> Second question: the need to distinguish between processed and unprocessed
> data, is this due to the Iterator running in all three scopes? Would a
> per-scanner Iterator or an Iterator running in scan scope be guaranteed to
> only see unprocessed data?
>
>
It's more that the iterator may run over the same data multiple times, not
that it runs in different scopes (although, different scopes increases the
number of times the data is iterated over). This could happen, for
instance, if a tablet is compacted multiple times and the only scope the
iterator is configured for is major compaction.

So, in response to the second part of this question, an iterator in the
scan scope would be guaranteed to only see unprocessed data if the iterator
has not been configured for minor compaction or major compaction scopes at
all.


> Thanks,
> -Russ
>
> 1:
> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>



--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rweeks@newbrightidea.com>
wrote:

> Hi, folks,
>
> The StatsCombiner[1] shows one way for an Iterator to distinguish between
> processed and unprocessed data. In this case, the StatsCombiner treats
> string representations of integers as unprocessed data and comma-separated
> string representations of integers as processed data.
>
> Two questions: First, is it possible to do this in an arbitrary fashion?
> For example, let's say my Iterator adds Values to a bloom filter which it
> maintains internally - like a combiner, but potentially across multiple
> CF's. If the iterator encounters unprocessed data, it should offer it to
> the bloom filter. If it encounters processed data (ie. a bloom filter), it
> should merge it with its own bloom filter.
>
> The only way that I can think of to do this is to have a higher-priority
> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
> Then my iterator can make decisions based on whether a current Value is or
> isn't escaped. I find this approach pretty kludgy though, and any advice is
> welcome.
>
> Second question: the need to distinguish between processed and unprocessed
> data, is this due to the Iterator running in all three scopes? Would a
> per-scanner Iterator or an Iterator running in scan scope be guaranteed to
> only see unprocessed data?
>
> Thanks,
> -Russ
>
> 1:
> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>

Mime
View raw message