accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Iterators that alter key-values
Date Mon, 18 May 2015 22:47:10 GMT
On Mon, May 18, 2015 at 6:31 PM, Keith Turner <keith@deenlo.com> wrote:

>
>
> On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle <
> hardcastle.dave@gmail.com> wrote:
>
>> A couple of follow-up questions...
>>
>> So, is it true to say that a filtering iterator that is filtering out a
>> high percentage of the key-values in a range, might have to redo a lot of
>> work if a reseek happens? (It's reseeked to the last emitted key, but a lot
>> of key-values past that may already have been rejected by the filter.)
>>
>
> This may happen, it depends on what the tserver is doing.   Lets assume a
> call to next on the iterator advances to the next top key, and not past
> it.  If the tserver calls next after the buffer is full, then what you
> described could happen.
>
> So if the tserver is doing something like the following, I think it would
> redo work.  Need to investigate this.
>

I investigated.  Sorry for the spam, should have done this before
sending.

The following code services batch scans.  Seems like it checks if the
buffer is full before calling next.


https://github.com/apache/accumulo/blob/1.6.2/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L1538

The following code services scans, its seems to also check if the buffer is
full before calling next.


https://github.com/apache/accumulo/blob/1.6.2/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L1684


>
>   iter = ....
>   iter.seek(...)
>
>   while(iter.hasTop() && !buffer.ifFull()){
>     buffer.add(iter.getTopKey(), iter.getTopValue())
>     iter.next()  //if this call to next is made even when buffer is full,
> it could redo work
>   }
>
>   return buffer;  //will reseek with last key (non-inclusive) in buffer
> later.
>
>
>
>>
>> Would it be worth making the fact the the reseek happens to the last
>> emitted key explicit in the documentation? It seems natural to me to assume
>> that the reseek happens to one key past the last read key. I don't think
>> the javadoc for the seek() method in SortedKeyValueIterator makes it quite
>> clear enough.
>>
>
> When the reseek is done using the last key returned, it makes it
> non-inclusive.   What are your thoughts on the following paragraph?
>
>
> https://github.com/apache/accumulo/blob/1.6.2/core/src/main/java/org/apache/accumulo/core/iterators/SortedKeyValueIterator.java#L81
>
>
>>
>> Thanks,
>>
>> Dave.
>>
>> On 15 May 2015 at 19:32, Eric Newton <eric.newton@gmail.com> wrote:
>>
>>> is it the same instance of the iterator object
>>>
>>>
>>> No, it is not.
>>>
>>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle <
>>> hardcastle.dave@gmail.com> wrote:
>>>
>>>> Jim,
>>>>
>>>> That explains a lot - I knew that the iterator stack could be resumed
>>>> in the middle of a range, but didn't realise that it used the last emitted
>>>> key to decide where to resume.
>>>>
>>>> Just so I'm clear, when iterators get stopped and later resumed, is it
>>>> the same instance of the iterator object that's restarted (so that I could
>>>> store state in there and use that to help the reseek) or is it a new
>>>> instance of the iterator that has to be able to resume purely on the basis
>>>> of the last emitted key?
>>>>
>>>> As you say though, it's probably best to stick to modifying values only.
>>>>
>>>> Thanks very much,
>>>>
>>>> Dave.
>>>>
>>>> On 15 May 2015 at 18:55, James Hughes <jnh5y@virginia.edu> wrote:
>>>>
>>>>> Hi Dave,
>>>>>
>>>>> The big thing to note is that your iterator stack may get stopped and
>>>>> torn down for various reasons.  As Accumulo recreates the stack, it will
>>>>> call 'seek' with the last emitted key in order to resume.
>>>>>
>>>>> If you are returning keys out of order in an iterator, the 'seek'
>>>>> method needs to be able to undo the transformation and call 'seek'
>>>>> appropriately.  That's not impossible, but it isn't trivial.
>>>>>
>>>>> In GeoMesa, we did something like that at one point (without having a
>>>>> smart 'seek').  I enjoyed two days of debugging trying to figure out
why
>>>>> medium sized requests would hang.  (There was an infinite loop....) 
From
>>>>> that experience, I'd suggest only modifying values.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle <
>>>>> hardcastle.dave@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've always assumed that the last iterator in the stack can make
>>>>>> arbitrary changes to keys and values, including not returning the
keys in
>>>>>> sorted order. I know that SortedKeyValueIterator says that "anything
>>>>>> implementing this interface should return keys in sorted order" -
but I
>>>>>> don't see a good reason that has to be true for the final iterator.
This
>>>>>> assumption seems to be backed up by the manual which says that "the
only
>>>>>> safe way to generate additional data in an iterator is to alter the
current
>>>>>> key-value pair" - it doesn't say that making arbitrary modifications
to the
>>>>>> rowkey or key is forbidden.
>>>>>>
>>>>>> I have a situation where I am making a transformation of the rowkey
>>>>>> that may not preserve the ordering of the keys. When I scan for individual
>>>>>> ranges I get the correct results. When I scan for two ranges using
a
>>>>>> BatchScanner, I get lots of data back which is not in the ranges
I queried
>>>>>> for. I am not explicitly checking that I have not gone beyond the
range,
>>>>>> but that should not be necessary as I am not doing any seeking, only
>>>>>> consuming the key-values I receive.
>>>>>>
>>>>>> So, my main question is whether the last iterator is allowed to not
>>>>>> return keys in sorted order?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dave.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message