accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Hardcastle <>
Subject Re: Iterators that alter key-values
Date Sat, 16 May 2015 15:12:40 GMT
Thanks James. I asked about the filtering example just to check my
understanding was right, but I agree it's probably a corner case.

Re the documentation - I don't think the problem is not conforming to the
sorted key part. If you had row keys which were integers in increasing
order, and in the iterator added a million to each row key and emitted that
then you'd still get problems if there was a reseek (assuming that adding a
million took you out of the range). Admittedly I can't see why you'd do
that, but I'd read the javadoc, the manual and the Accumulo book carefully
and I hadn't picked up that the actual key that is emitted is relevant to
the reseek issue.

BTW, none of this is meant to reflect badly on the iterator stack - they're
really powerful and are one of Accumulo's main selling points.


On 16 May 2015 at 14:55, James Hughes <> wrote:

> Hi Dave,
> I can speak to the first question a little bit.  The one time I saw this,
> I traced the code and saw that after emitting a certain number of bytes,
> the iterator stack was recreated.  In that case, no further keys would have
> been filtered since the current key-value pair being emitted would trigger
> the reset and that key would be used for the re-seek.  I'll apply all
> caveats to that explanation: it was Accumulo 1.4 and didn't learn about why
> the stack was stopped and recreated or other times that may happen.
> On the other hand, one could imagine a tablet server dying in the middle
> of returning entries.  I have no idea of the details of how Accumulo
> handles that.  Worst case, you may be right about some reprocessing, but
> all this sounds like a corner case.
> For the documentation, writing about implementation details directly may
> not be the best way.  I'd hope that the documentation would make it clear
> that all iterators (even presumed 'top' or 'final' iterators) should
> conform to the 'sorted key' part of the contract.
> Thanks,
> Jim
> On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle <
>> wrote:
>> A couple of follow-up questions...
>> So, is it true to say that a filtering iterator that is filtering out a
>> high percentage of the key-values in a range, might have to redo a lot of
>> work if a reseek happens? (It's reseeked to the last emitted key, but a lot
>> of key-values past that may already have been rejected by the filter.)
>> Would it be worth making the fact the the reseek happens to the last
>> emitted key explicit in the documentation? It seems natural to me to assume
>> that the reseek happens to one key past the last read key. I don't think
>> the javadoc for the seek() method in SortedKeyValueIterator makes it quite
>> clear enough.
>> Thanks,
>> Dave.
>> On 15 May 2015 at 19:32, Eric Newton <> wrote:
>>> is it the same instance of the iterator object
>>> No, it is not.
>>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle <
>>>> wrote:
>>>> Jim,
>>>> That explains a lot - I knew that the iterator stack could be resumed
>>>> in the middle of a range, but didn't realise that it used the last emitted
>>>> key to decide where to resume.
>>>> Just so I'm clear, when iterators get stopped and later resumed, is it
>>>> the same instance of the iterator object that's restarted (so that I could
>>>> store state in there and use that to help the reseek) or is it a new
>>>> instance of the iterator that has to be able to resume purely on the basis
>>>> of the last emitted key?
>>>> As you say though, it's probably best to stick to modifying values only.
>>>> Thanks very much,
>>>> Dave.
>>>> On 15 May 2015 at 18:55, James Hughes <> wrote:
>>>>> Hi Dave,
>>>>> The big thing to note is that your iterator stack may get stopped and
>>>>> torn down for various reasons.  As Accumulo recreates the stack, it will
>>>>> call 'seek' with the last emitted key in order to resume.
>>>>> If you are returning keys out of order in an iterator, the 'seek'
>>>>> method needs to be able to undo the transformation and call 'seek'
>>>>> appropriately.  That's not impossible, but it isn't trivial.
>>>>> In GeoMesa, we did something like that at one point (without having a
>>>>> smart 'seek').  I enjoyed two days of debugging trying to figure out
>>>>> medium sized requests would hang.  (There was an infinite loop....) 
>>>>> that experience, I'd suggest only modifying values.
>>>>> Cheers,
>>>>> Jim
>>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle <
>>>>>> wrote:
>>>>>> Hi,
>>>>>> I've always assumed that the last iterator in the stack can make
>>>>>> arbitrary changes to keys and values, including not returning the
keys in
>>>>>> sorted order. I know that SortedKeyValueIterator says that "anything
>>>>>> implementing this interface should return keys in sorted order" -
but I
>>>>>> don't see a good reason that has to be true for the final iterator.
>>>>>> assumption seems to be backed up by the manual which says that "the
>>>>>> safe way to generate additional data in an iterator is to alter the
>>>>>> key-value pair" - it doesn't say that making arbitrary modifications
to the
>>>>>> rowkey or key is forbidden.
>>>>>> I have a situation where I am making a transformation of the rowkey
>>>>>> that may not preserve the ordering of the keys. When I scan for individual
>>>>>> ranges I get the correct results. When I scan for two ranges using
>>>>>> BatchScanner, I get lots of data back which is not in the ranges
I queried
>>>>>> for. I am not explicitly checking that I have not gone beyond the
>>>>>> but that should not be necessary as I am not doing any seeking, only
>>>>>> consuming the key-values I receive.
>>>>>> So, my main question is whether the last iterator is allowed to not
>>>>>> return keys in sorted order?
>>>>>> Thanks,
>>>>>> Dave.

View raw message