accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Iterators that alter key-values
Date Mon, 18 May 2015 22:31:02 GMT
On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle <>

> A couple of follow-up questions...
> So, is it true to say that a filtering iterator that is filtering out a
> high percentage of the key-values in a range, might have to redo a lot of
> work if a reseek happens? (It's reseeked to the last emitted key, but a lot
> of key-values past that may already have been rejected by the filter.)

This may happen, it depends on what the tserver is doing.   Lets assume a
call to next on the iterator advances to the next top key, and not past
it.  If the tserver calls next after the buffer is full, then what you
described could happen.

So if the tserver is doing something like the following, I think it would
redo work.  Need to investigate this.

  iter = ....

  while(iter.hasTop() && !buffer.ifFull()){
    buffer.add(iter.getTopKey(), iter.getTopValue())  //if this call to next is made even when buffer is full,
it could redo work

  return buffer;  //will reseek with last key (non-inclusive) in buffer

> Would it be worth making the fact the the reseek happens to the last
> emitted key explicit in the documentation? It seems natural to me to assume
> that the reseek happens to one key past the last read key. I don't think
> the javadoc for the seek() method in SortedKeyValueIterator makes it quite
> clear enough.

When the reseek is done using the last key returned, it makes it
non-inclusive.   What are your thoughts on the following paragraph?

> Thanks,
> Dave.
> On 15 May 2015 at 19:32, Eric Newton <> wrote:
>> is it the same instance of the iterator object
>> No, it is not.
>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle <
>>> wrote:
>>> Jim,
>>> That explains a lot - I knew that the iterator stack could be resumed in
>>> the middle of a range, but didn't realise that it used the last emitted key
>>> to decide where to resume.
>>> Just so I'm clear, when iterators get stopped and later resumed, is it
>>> the same instance of the iterator object that's restarted (so that I could
>>> store state in there and use that to help the reseek) or is it a new
>>> instance of the iterator that has to be able to resume purely on the basis
>>> of the last emitted key?
>>> As you say though, it's probably best to stick to modifying values only.
>>> Thanks very much,
>>> Dave.
>>> On 15 May 2015 at 18:55, James Hughes <> wrote:
>>>> Hi Dave,
>>>> The big thing to note is that your iterator stack may get stopped and
>>>> torn down for various reasons.  As Accumulo recreates the stack, it will
>>>> call 'seek' with the last emitted key in order to resume.
>>>> If you are returning keys out of order in an iterator, the 'seek'
>>>> method needs to be able to undo the transformation and call 'seek'
>>>> appropriately.  That's not impossible, but it isn't trivial.
>>>> In GeoMesa, we did something like that at one point (without having a
>>>> smart 'seek').  I enjoyed two days of debugging trying to figure out why
>>>> medium sized requests would hang.  (There was an infinite loop....)  From
>>>> that experience, I'd suggest only modifying values.
>>>> Cheers,
>>>> Jim
>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle <
>>>>> wrote:
>>>>> Hi,
>>>>> I've always assumed that the last iterator in the stack can make
>>>>> arbitrary changes to keys and values, including not returning the keys
>>>>> sorted order. I know that SortedKeyValueIterator says that "anything
>>>>> implementing this interface should return keys in sorted order" - but
>>>>> don't see a good reason that has to be true for the final iterator. This
>>>>> assumption seems to be backed up by the manual which says that "the only
>>>>> safe way to generate additional data in an iterator is to alter the current
>>>>> key-value pair" - it doesn't say that making arbitrary modifications
to the
>>>>> rowkey or key is forbidden.
>>>>> I have a situation where I am making a transformation of the rowkey
>>>>> that may not preserve the ordering of the keys. When I scan for individual
>>>>> ranges I get the correct results. When I scan for two ranges using a
>>>>> BatchScanner, I get lots of data back which is not in the ranges I queried
>>>>> for. I am not explicitly checking that I have not gone beyond the range,
>>>>> but that should not be necessary as I am not doing any seeking, only
>>>>> consuming the key-values I receive.
>>>>> So, my main question is whether the last iterator is allowed to not
>>>>> return keys in sorted order?
>>>>> Thanks,
>>>>> Dave.

View raw message