accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Hughes <jn...@virginia.edu>
Subject Re: Iterators returning keys out of scan range
Date Sat, 25 May 2013 18:36:39 GMT
Hi all,

Just to echo and expand on Adam's comment, I'd suggest not trying this at
work either!

This week, I was moving some code from depending on 1.3.4 to Accumulo
1.4.3, and I was tracking down some kind of problem with an iterator we had
written which worked for *most* cases.  When it broke, the iterator would
return an infinite loop.  In the end, we only rarely had data big enough to
trip the reseek that Adam mentioned, and we needed to update our code to
deal with that correctly.

As for some kind of test, I am unsure if the Mock versions do reseeking,
etc.  If they do, then you'd just  need to reason through having enough
data in a row to make the Tablet/scanner reseek.  (The size cut-off that I
saw the Tablet looking for was 1 megabyte.)  If Mock doesn't reseek, this
kind of test would need to be on a run Accumulo setup.

That said, I think the same behavior could be seen in a unit test by
reseeking after *each* call to next.  At least this would test an
iterator's ability to reseek to any arbitrary position.  I imagine writing
a ReseekingIterator wouldn't be hard, and then one could add it as the last
iterator in any exisiting unit tests...

I think there are two principles to test here.  First, all iterators should
provide a sorted "view" of their underlying input.  And second, an iterator
should be able to resume (i.e., be re-seeked) from the last key it
returned.  I say "view" since an iterator could be combining multiple rows
into something else to be returned to the client.

Jim


On Sat, May 25, 2013 at 1:09 PM, Christopher <ctubbsii@apache.org> wrote:

> He's talking about using iterators that transform keys (we don't have
> any built-in, IIRC), like those that extend the new
> TransformingIterator. Scanner logic is written, such that it will
> resume scanning from the last key it received. This is important for
> handling failures and splits/migrations during a scan. So, in this
> context, a "reversible transformation" simply means that when the
> client tells the tserver's iterator stack scan, it can transform what
> the client thinks is the starting point for the scan, back to what it
> actually should have been prior to transformation, so it can resume
> from the correct place. This is necessary, because the client will not
> know what the data looked like prior to transformation, as it only
> sees data returned from the iterator stack.
>
> Now, the assumption here, is that the key that the client *thinks* is
> the starting point is in the same tablet that the real starting *is*.
> Otherwise, it doesn't matter if the transformation is reversible,
> because the real starting point could be on a different tablet
> entirely (due to splits). To ensure this doesn't happen, it's
> important to make sure that transforming iterators that you implement
> do not transform the RowID portion of the key... or else, if they do,
> they can send a special key back, that is understood by client code
> that can inform the client to query a different tablet server... the
> one the client needs to resume scanning from.
>
> Yes, there should be unit tests, but the unit tests would be against
> iterators that actually transform keys in this way... and I don't
> think we provide any. That'd be user code.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sat, May 25, 2013 at 9:36 AM, David Medinets
> <david.medinets@gmail.com> wrote:
> > Is there a unit test exposing this behavior? And what does "reversible
> > transformation" mean?
> >
> >
> > On Wed, May 1, 2013 at 8:36 PM, Adam Fuchs <afuchs@apache.org> wrote:
> >>
> >> For all the rest of you on this thread, the big problem you'll run into
> >> when returning keys out of range is that the reseeking behavior will
> skip a
> >> bunch of underlying keys (i.e. don't try this at home). For example,
> say you
> >> have tablets ["A","D"], ("D","M"], and ("M","ZZZZ..."]. If you do a
> query on
> >> ["A","M"] and return "N" after seeing the underlying key "A", you may
> never
> >> see keys from the ("D","M"] tablet. A good rule of thumb is to return
> keys
> >> in the same row as the underlying keys that were used to generate them
> and
> >> use a reversible transformation of columns within each row.
> >>
> >
>

Mime
View raw message