accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Iterators that alter key-values
Date Sun, 17 May 2015 03:14:52 GMT
(and let us know via JIRA if there's more than can be expanded on or 
clarified. The HTML version of these docs should be up soon too, btw)

Dylan Hutchison wrote:
> Dave,
>
> Check out the new chapter on iterator design
> <https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/iterator_design.txt>
> going into 1.7 (applicable to all versions).
>
> Emitting entries in unsorted order should be ok for scan iterators but
> definitely not for compaction iterators.  Compaction iterators will fail
> when the FileSKVWriter sees an entry out of order
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L370>.
>
>
> Use cases like these are exciting.  Be prepared to debug the tablet
> server if your idea doesn't work.
>
> Cheers, Dylan
>
>
> On Sat, May 16, 2015 at 11:12 AM, Dave Hardcastle
> <hardcastle.dave@gmail.com <mailto:hardcastle.dave@gmail.com>> wrote:
>
>     Thanks James. I asked about the filtering example just to check my
>     understanding was right, but I agree it's probably a corner case.
>
>     Re the documentation - I don't think the problem is not conforming
>     to the sorted key part. If you had row keys which were integers in
>     increasing order, and in the iterator added a million to each row
>     key and emitted that then you'd still get problems if there was a
>     reseek (assuming that adding a million took you out of the range).
>     Admittedly I can't see why you'd do that, but I'd read the javadoc,
>     the manual and the Accumulo book carefully and I hadn't picked up
>     that the actual key that is emitted is relevant to the reseek issue.
>
>     BTW, none of this is meant to reflect badly on the iterator stack -
>     they're really powerful and are one of Accumulo's main selling points.
>
>     Dave.
>
>
>     On 16 May 2015 at 14:55, James Hughes <jnh5y@virginia.edu
>     <mailto:jnh5y@virginia.edu>> wrote:
>
>         Hi Dave,
>
>         I can speak to the first question a little bit.  The one time I
>         saw this, I traced the code and saw that after emitting a
>         certain number of bytes, the iterator stack was recreated.  In
>         that case, no further keys would have been filtered since the
>         current key-value pair being emitted would trigger the reset and
>         that key would be used for the re-seek.  I'll apply all caveats
>         to that explanation: it was Accumulo 1.4 and didn't learn about
>         why the stack was stopped and recreated or other times that may
>         happen.
>
>         On the other hand, one could imagine a tablet server dying in
>         the middle of returning entries.  I have no idea of the details
>         of how Accumulo handles that.  Worst case, you may be right
>         about some reprocessing, but all this sounds like a corner case.
>
>         For the documentation, writing about implementation details
>         directly may not be the best way.  I'd hope that the
>         documentation would make it clear that all iterators (even
>         presumed 'top' or 'final' iterators) should conform to the
>         'sorted key' part of the contract.
>
>         Thanks,
>
>         Jim
>
>
>         On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle
>         <hardcastle.dave@gmail.com <mailto:hardcastle.dave@gmail.com>>
>         wrote:
>
>             A couple of follow-up questions...
>
>             So, is it true to say that a filtering iterator that is
>             filtering out a high percentage of the key-values in a
>             range, might have to redo a lot of work if a reseek happens?
>             (It's reseeked to the last emitted key, but a lot of
>             key-values past that may already have been rejected by the
>             filter.)
>
>             Would it be worth making the fact the the reseek happens to
>             the last emitted key explicit in the documentation? It seems
>             natural to me to assume that the reseek happens to one key
>             past the last read key. I don't think the javadoc for the
>             seek() method in SortedKeyValueIterator makes it quite clear
>             enough.
>
>             Thanks,
>
>             Dave.
>
>             On 15 May 2015 at 19:32, Eric Newton <eric.newton@gmail.com
>             <mailto:eric.newton@gmail.com>> wrote:
>
>                     is it the same instance of the iterator object
>
>
>                 No, it is not.
>
>                 On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle
>                 <hardcastle.dave@gmail.com
>                 <mailto:hardcastle.dave@gmail.com>> wrote:
>
>                     Jim,
>
>                     That explains a lot - I knew that the iterator stack
>                     could be resumed in the middle of a range, but
>                     didn't realise that it used the last emitted key to
>                     decide where to resume.
>
>                     Just so I'm clear, when iterators get stopped and
>                     later resumed, is it the same instance of the
>                     iterator object that's restarted (so that I could
>                     store state in there and use that to help the
>                     reseek) or is it a new instance of the iterator that
>                     has to be able to resume purely on the basis of the
>                     last emitted key?
>
>                     As you say though, it's probably best to stick to
>                     modifying values only.
>
>                     Thanks very much,
>
>                     Dave.
>
>                     On 15 May 2015 at 18:55, James Hughes
>                     <jnh5y@virginia.edu <mailto:jnh5y@virginia.edu>> wrote:
>
>                         Hi Dave,
>
>                         The big thing to note is that your iterator
>                         stack may get stopped and torn down for various
>                         reasons.  As Accumulo recreates the stack, it
>                         will call 'seek' with the last emitted key in
>                         order to resume.
>
>                         If you are returning keys out of order in an
>                         iterator, the 'seek' method needs to be able to
>                         undo the transformation and call 'seek'
>                         appropriately.  That's not impossible, but it
>                         isn't trivial.
>
>                         In GeoMesa, we did something like that at one
>                         point (without having a smart 'seek').  I
>                         enjoyed two days of debugging trying to figure
>                         out why medium sized requests would hang.
>                           (There was an infinite loop....)  From that
>                         experience, I'd suggest only modifying values.
>
>                         Cheers,
>
>                         Jim
>
>
>                         On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle
>                         <hardcastle.dave@gmail.com
>                         <mailto:hardcastle.dave@gmail.com>> wrote:
>
>                             Hi,
>
>                             I've always assumed that the last iterator
>                             in the stack can make arbitrary changes to
>                             keys and values, including not returning the
>                             keys in sorted order. I know that
>                             SortedKeyValueIterator says that "anything
>                             implementing this interface should return
>                             keys in sorted order" - but I don't see a
>                             good reason that has to be true for the
>                             final iterator. This assumption seems to be
>                             backed up by the manual which says that "the
>                             only safe way to generate additional data in
>                             an iterator is to alter the current
>                             key-value pair" - it doesn't say that making
>                             arbitrary modifications to the rowkey or key
>                             is forbidden.
>
>                             I have a situation where I am making a
>                             transformation of the rowkey that may not
>                             preserve the ordering of the keys. When I
>                             scan for individual ranges I get the correct
>                             results. When I scan for two ranges using a
>                             BatchScanner, I get lots of data back which
>                             is not in the ranges I queried for. I am not
>                             explicitly checking that I have not gone
>                             beyond the range, but that should not be
>                             necessary as I am not doing any seeking,
>                             only consuming the key-values I receive.
>
>                             So, my main question is whether the last
>                             iterator is allowed to not return keys in
>                             sorted order?
>
>                             Thanks,
>
>                             Dave.
>
>
>
>
>
>
>
>

Mime
View raw message