accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dylan Hutchison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4049) create a generic counting iterator
Date Tue, 10 Nov 2015 08:09:10 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998221#comment-14998221
] 

Dylan Hutchison commented on ACCUMULO-4049:
-------------------------------------------

How about:
1. Generalizing {{FirstKeyInRowIterator}} to a new class {{FirstPartialKeyInRowIterator}}
(or perhaps {{StrippingIterator}}).
2. Modifying {{CountingIterator}} to act as a user iterator, being careful to return entries
within the range to which it is seeked.  It still just counts entries from its parent iterator.
3. Creating a user convenience class {{CountPartialKeysIterator}}.  Internally it creates
instances of the above two iterators; no new functionality; all this does is save the user
from needing to add two iterator options.  It accepts an option for the partial key: null
(count all = no need for iterator #1), row (use both iterators), column family (use both iterators),
column qualifier (use both iterators), column visibility (use both iterators), timestamp (same
as count all), delete key (same as count all; we don't normally see deletes during scans).

Client code would look like
{code}
BatchScanner bs;
bs.setRanges(some_ranges);
IteratorSetting is = new IteratorSetting(25, "my_row_counter", CountPartialKeysIterator.class);
CountPartialKeysIterator.setPartialKey(PartialKey.ROW);
bs.addScanIterator(is);
long cnt = 0l;
try {
  for (Map.Entry<Key, Value> entry : bs) {
    cnt += Long.parseLong(new String(entry.getValue().get()));
  }
} finally {
  bs.close();
}
return cnt;
{code}

I could look at this during the weekend if the plan looks good.  

This iterator should not be used at compaction time because it is not idempotent.  
The convenience class {{CountPartialKeysIterator}} is not a perfect solution. It would be
better if we created a generic iterator that could load multiple iterators at runtime inside
{{init}}, but this is out of this ticket's scope.
We can set the default behavior of {{CountPartialKeysIterator}} to count all entries if no
iterator option is set.

I thought of naming iterator #3 {{CountPartsIterator}}, but this name could lead users to
believe that it can count the number of unique column families.  The iterator can only count
the number of row+family partial keys; counting the number of unique column families requires
more complexity and is less suitable for Accumulo's layout.

> create a generic counting iterator
> ----------------------------------
>
>                 Key: ACCUMULO-4049
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4049
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: core
>            Reporter: Adam Fuchs
>
> As a user I want to be able to count the number of key/values, rows, row+column family,
etc. exist in a range. This could be done via a simple iterator like the CountingIterator
added at scan time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message