accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Distinguishing between processed and unprocessed data in an Iterator
Date Wed, 01 Oct 2014 01:34:29 GMT
> an iterator in the scan scope would be guaranteed to only see unprocessed
data if the iterator has not been configured for minor compaction or major
compaction scopes at all

Excellent, thanks Christopher. That simplifies things. One more question: I
understand that an iterator may be re-seeked at any point in its lifetime,
which could cause it to see unprocessed data a second time. I assume this
is true for scan-scope iterators as well?

-Russ

On Tue, Sep 30, 2014 at 3:46 PM, Christopher <ctubbsii@apache.org> wrote:

> On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rweeks@newbrightidea.com>
> wrote:
>
>> Hi, folks,
>>
>> The StatsCombiner[1] shows one way for an Iterator to distinguish between
>> processed and unprocessed data. In this case, the StatsCombiner treats
>> string representations of integers as unprocessed data and comma-separated
>> string representations of integers as processed data.
>>
>> Two questions: First, is it possible to do this in an arbitrary fashion?
>> For example, let's say my Iterator adds Values to a bloom filter which it
>> maintains internally - like a combiner, but potentially across multiple
>> CF's. If the iterator encounters unprocessed data, it should offer it to
>> the bloom filter. If it encounters processed data (ie. a bloom filter), it
>> should merge it with its own bloom filter.
>>
>> The only way that I can think of to do this is to have a higher-priority
>> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
>> Then my iterator can make decisions based on whether a current Value is or
>> isn't escaped. I find this approach pretty kludgy though, and any advice is
>> welcome.
>>
>>
> Sure, you could generalize this, like standardize the way you flag data as
> evaluated. However, I think most people would interpret "evaluated" to mean
> "evaluated by this specific iterator", which would imply that the flagging
> is iterator-specific.
>
>
>> Second question: the need to distinguish between processed and
>> unprocessed data, is this due to the Iterator running in all three scopes?
>> Would a per-scanner Iterator or an Iterator running in scan scope be
>> guaranteed to only see unprocessed data?
>>
>>
> It's more that the iterator may run over the same data multiple times, not
> that it runs in different scopes (although, different scopes increases the
> number of times the data is iterated over). This could happen, for
> instance, if a tablet is compacted multiple times and the only scope the
> iterator is configured for is major compaction.
>
> So, in response to the second part of this question, an iterator in the
> scan scope would be guaranteed to only see unprocessed data if the iterator
> has not been configured for minor compaction or major compaction scopes at
> all.
>
>
>> Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>>
>
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Tue, Sep 30, 2014 at 6:27 PM, Russ Weeks <rweeks@newbrightidea.com>
> wrote:
>
>> Hi, folks,
>>
>> The StatsCombiner[1] shows one way for an Iterator to distinguish between
>> processed and unprocessed data. In this case, the StatsCombiner treats
>> string representations of integers as unprocessed data and comma-separated
>> string representations of integers as processed data.
>>
>> Two questions: First, is it possible to do this in an arbitrary fashion?
>> For example, let's say my Iterator adds Values to a bloom filter which it
>> maintains internally - like a combiner, but potentially across multiple
>> CF's. If the iterator encounters unprocessed data, it should offer it to
>> the bloom filter. If it encounters processed data (ie. a bloom filter), it
>> should merge it with its own bloom filter.
>>
>> The only way that I can think of to do this is to have a higher-priority
>> iterator that "escapes" Values, and have my Iterator emit unescaped Values.
>> Then my iterator can make decisions based on whether a current Value is or
>> isn't escaped. I find this approach pretty kludgy though, and any advice is
>> welcome.
>>
>> Second question: the need to distinguish between processed and
>> unprocessed data, is this due to the Iterator running in all three scopes?
>> Would a per-scanner Iterator or an Iterator running in scan scope be
>> guaranteed to only see unprocessed data?
>>
>> Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/apache/accumulo/blob/master/examples/simple/src/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java
>>
>
>

Mime
View raw message