accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Tue, 15 Jul 2014 16:29:05 GMT
There's been some mention about a desire to rethink the Iterator 
interface as it has some deficiencies (notably the lack of a "cleanup" 
before the iterators are torn down), but no one has stated that they're 
actively working on this.

Getting better documentation wrt to convetions: let us know where the 
Accumulo documentation falls short (and give us patches to fix the 
documentation :D). Additionally, write up your own findings from 
problems that you've run into. It's the entire community (users 
specifically) that we need to help encourage to grow.

Even things as simple as "how do I count entries in an iterator" are big 
as you are now an "expert" on the subject :)

On 7/15/14, 12:17 PM, Michael Moss wrote:
> That worked ;) - Thanks!
>
> What a journey...
>
> I like Accumulo's architecture and promise, but the difficulty in
> querying it (lack of documentation, conventions) is a major concern and
> I'd imagine has to have an impact on adoption. I'm curious if there have
> been any conversations around changing the interface around iterators
> which are still confusing to me. Let me know how I can help!
>
>
> On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net>>
> wrote:
>
>     Herp... serves me right for not setting up a proper test case.
>
>     I think you need to override seek as well:
>
>     @Override
>     public void seek(...) throws IOException {
>        super.seek(...);
>        next();
>     }
>
>     I think I just realized the wrapping iterator could use some clean
>     up, because this isn't obvious. Basically after the wrapping
>     iterator's seek is called, it never calls the implementor's next()
>     to actually set up the first top key and value.
>
>
>
>     On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
>     <michael.moss@gmail.com <mailto:michael.moss@gmail.com>> wrote:
>
>         I set up debugging and am rethrowing the exception. What's
>         strange is it appears that despite the iterator instance being
>         properly set to iterator.Counter (my implementation), my
>         breakpoints aren't being hit, only in the parent classes
>         (Wrapping Iterator) and (SortedKeyValueIterator).
>
>         I have two rows in the table, when I scan with no iterator:
>         2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
>         user: root; action: scan; targetTable: pojo; authorizations:
>         public,; range: (-inf,+inf); columns: []; iterators: [];
>         iteratorOptions: {};
>         2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
>         tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in
>         0.01 secs, nbTimes = [7 7 7.00 1]
>
>         When I scan with the iterator (0 entries?):
>         2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted;
>         user: root; action: scan; targetTable: pojo; authorizations:
>         public,; range: (-inf,+inf); columns: []; iterators: [];
>         iteratorOptions: {};
>         2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
>         tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in
>         0.01 secs, nbTimes = [6 6 6.00 1]
>
>         No exceptions otherwise. Really appreciate all the ongoing help.
>
>         Best,
>
>         -Mike
>
>
>         On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
>         <wilhelm.von.cloud@accumulo.net
>         <mailto:wilhelm.von.cloud@accumulo.net>> wrote:
>
>             Anything in your Tserver log? I think you should just
>             rethrow that IOExcepton on your source's next() method,
>             since they're usually not recoverable (ie, just make
>             Counter#next throw IOException)
>
>
>             On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
>             <josh.elser@gmail.com <mailto:josh.elser@gmail.com>> wrote:
>
>                 A quick sanity check is to make sure you have data in
>                 the table and that you can read the data without your
>                 iterator (I've thought I had a bug because I didn't have
>                 proper visibilities more times than I'd like to admit).
>
>                 Alternatively, you can also enable remote-debugging via
>                 Eclipse into the TabletServer which might help you
>                 understand more of what's going on.
>
>                 Lots of articles on how to set this up [1]. In short,
>                 add -Xdebug
>                 -Xrunjdwp:transport=dt_socket,__server=y,address=8000 to
>                 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
>                 tserver, connect eclipse to 8000 via the Debug
>                 configuration menu, set a breakpoint in your init, seek
>                 and next methods, and `scan` in the shell.
>
>
>                 [1]
>                 http://javarevisited.blogspot.__com/2011/02/how-to-setup-__remote-debugging-in.html
>                 <http://javarevisited.blogspot.com/2011/02/how-to-setup-remote-debugging-in.html>
>
>
>                 On 7/14/14, 5:33 PM, Michael Moss wrote:
>
>                     Hmm...Still doesn't return anything from the shell.
>
>                     http://pastebin.com/ndRhspf8
>
>                     Any thoughts? What's the best way to debug these?
>
>
>                     On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
>                     <wilhelm.von.cloud@accumulo.__net
>                     <mailto:wilhelm.von.cloud@accumulo.net>
>                     <mailto:wilhelm.von.cloud@__accumulo.net
>                     <mailto:wilhelm.von.cloud@accumulo.net>>>
>
>                     wrote:
>
>                          Ah, an artifact of me just willy nilly writing
>                     an iterator :) Any
>                          reference to `this.source` should be replaced with
>                          `this.getSource()`. In `next()`, your
>                     workaround ends up calling
>                          `this.hasTop()` as the while loop condition. It
>                     will always return
>                          false because two lines up we set `top_key` to
>                     null. We need to make
>                          sure that the source iterator has a top,
>                     because we want to read
>                          data from it. We'll have to change the loop
>                     condition to
>                          `while(this.getSource().__hasTop())`. On line
>                     38 of your code we'll
>                          need to call `this.getSource().next()` instead
>                     of `this.next()`.
>
>                          The iterator interface is documented, but there
>                     hasn't been a
>                          definitive go-to for making one. I've been
>                     drafting a blog post, but
>                          since it doesn't exist yet, hopefully the
>                     following will suffice.
>
>                          The lifetime of an iterator is (usually) as
>                     follows:
>
>                          (1) A new instance is called via
>                     Class.newInstance (so a no-args
>                          constructor is needed)
>                          (2) Init is called. This allows users to
>                     configure the iterator, set
>                          its source, and possible check the environment.
>                     We can also call
>                          `deepCopy` on the source if we want to have
>                     multiple sources (we'd
>                          do this if we wanted to do a merge read out of
>                     multiple column
>                          families within a row).
>                          (3) seek() is called. This gets our readers to
>                     the correct positions
>                          in the data that are within the scan range the
>                     user requested, as
>                          well as turning column families on or off. The
>                     name should
>                          reminiscent of seeking to some key on disk.
>                          (4) hasTop() is called. If true, that means we
>                     have data, and the
>                          iterator has a key/value pair that can be
>                     retrieved by calling
>                          getTopKey() and getTopValue(). If fasle, we're
>                     done because there's
>                          no data to return.
>                          (5) next() is called. This will attempt find a
>                     new top key and
>                          value. We go back to (4) to see if next was
>                     successful in finding a
>                          new top key/value and will repeat until the
>                     client is satisfied or
>                          hasTop() returns false.
>
>                          You can kind of make a state machine out of
>                     those steps where we
>                          loop between (4) and (5) until there's no data.
>                     There are more
>                          advanced workflows where next() can be reading
>                     from multiple
>                          sources, as well as seeking them to different
>                     positions in the tablet.
>
>
>                          On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>                          <michael.moss@gmail.com
>                     <mailto:michael.moss@gmail.com>
>                     <mailto:michael.moss@gmail.com
>                     <mailto:michael.moss@gmail.com>__>> wrote:
>
>                              Thanks, William. I was just hitting you up
>                     for an example :)
>
>                              I adapted your pseudocode
>                     (http://pastebin.com/ufPJq0g3)__, but
>                              noticed that "this.source" in your example
>                     didn't have
>                              visibility. Did I worked around it correctly?
>
>                              When I add my iterator to my table and run
>                     scan from the shell,
>                              it returns nothing - what should I expect
>                     here? In general I've
>                              found the iterator interface pretty
>                     confusing and haven't spent
>                              the time wrapping my head around it yet.
>                     Any documentation or
>                              examples (beyond what I could find on the
>                     site or in the code)
>                              appreciated!
>
>                              /root@dev> table pojo/
>                              /root@dev pojo> listiter -scan -t pojo/
>                              /-/
>                              /-    Iterator counter, scan scope options:/
>                              /-        iteratorPriority = 10/
>                              /-        iteratorClassName =
>                     iterators.Counter/
>                              /-/
>                              /root@dev pojo> scan/
>                              /root@dev pojo>/
>
>
>                              Best,
>
>                              -Mike
>
>
>
>
>                              On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
>                              <wilhelm.von.cloud@accumulo.__net
>                     <mailto:wilhelm.von.cloud@accumulo.net>
>                              <mailto:wilhelm.von.cloud@__accumulo.net
>                     <mailto:wilhelm.von.cloud@accumulo.net>>> wrote:
>
>                                  For a bit of psuedocode, I'd probably
>                     make a class that did
>                                  something akin to:
>                     http://pastebin.com/pKqAeeCR
>
>                                  I wrote that up real quick in a text
>                     editor-- it won't
>                                  compile or anything, but should point
>                     you in the right
>                                  direction.
>
>
>                                  On Mon, Jul 14, 2014 at 3:44 PM,
>                     William Slacum
>                                  <wilhelm.von.cloud@accumulo.__net
>                     <mailto:wilhelm.von.cloud@accumulo.net>
>
>                     <mailto:wilhelm.von.cloud@__accumulo.net
>                     <mailto:wilhelm.von.cloud@accumulo.net>>> wrote:
>
>                                      Hi Mike!
>
>                                      The Combiner interface is only for
>                     aggregating keys
>                                      within a single row. You can
>                     probably get away with
>                                      implementing your combining logic
>                     in a WrappingIterator
>                                      that reads across all the rows in a
>                     given tablet.
>
>                                      To do some combine/fold/reduce
>                     operation, Accumulo needs
>                                      the input type to be the same as
>                     the output type. The
>                                      combiner doesn't have a notion of a
>                     "present" type (as
>                                      you'd see in something like
>                     Algebird's Groups), but you
>                                      can use another iterator to perform
>                     your transformation.
>
>                                      If you wanted to extract the
>                     "count" field from your
>                                      Avro object, you could write a new
>                     Iterator that took
>                                      your Avro object, extracted the
>                     desired field, and
>                                      returned it as its top value. You
>                     can then set this
>                                      iterator as the source of the
>                     aggregator, either
>                                      programmatically or via by wrapping
>                     the source object
>                                      passed to the aggregator in its
>                                      SortedKeyValueIterator#init call.
>
>                                      This is a bit inefficient as you'd
>                     have to serialize to
>                                      a Value and then immediately
>                     deserialize it in the
>                                      iterator above it. You could
>                     mitigate this by exposing a
>                                      method that would get the extracted
>                     value before
>                                      serializing it.
>
>                                      This kind of counting also requires
>                     client side logic to
>                                      do a final combine operation, since
>                     the aggregations
>                                      from all the tservers are partial
>                     results.
>
>                                      I believe that CountingIterator is
>                     not meant for user
>                                      consumption, but I do not know if
>                     it's related to your
>                                      issue in trying to use it from the
>                     shell. Iterators set
>                                      through the shell, in previous
>                     versions of Accumulo,
>                                      have a requirement to implement
>                     OptionDescriber. Many
>                                      default iterators do not implement
>                     this, and thus can't
>                                      set in the shell.
>
>
>
>                                      On Mon, Jul 14, 2014 at 2:44 PM,
>                     Michael Moss
>                                      <michael.moss@gmail.com
>                     <mailto:michael.moss@gmail.com>
>                     <mailto:michael.moss@gmail.com
>                     <mailto:michael.moss@gmail.com>__>>
>
>                                      wrote:
>
>                                          Hi, All.
>
>                                          I'm curious what the best
>                     practices are around
>                                          persisting complex types/data
>                     in Accumulo (and
>                                          aggregating on fields within them).
>
>                                          Let's say I have (row, column
>                     family, column
>                                          qualifier, value):
>                                          "A" "foo" ""
>                     MyHugeAvroObject(count=2)
>                                          "A" "foo" ""
>                     MyHugeAvroObject(count=3)
>
>                                          Let's say MyHugeAvroObject has
>                     a field "Integer
>                                          count" with the values above.
>
>                                          What is the best way to
>                     aggregate on row, column
>                                          family, column qualifier by
>                     count? In my above example:
>                                          "A" "foo" "" 5
>
>                                          The
>                     TypedValueCombiner.typedReduce method can
>                                          deserialize any "V", in my case
>                     MyHugeAvroObject,
>                                          but it needs to return a value
>                     of type "V". What are
>                                          the best practices for deeply
>                     nested/complex
>                                          objects? It's not always
>                     straightforward to map a
>                                          complex Avro type into Row ->
>                     Column Family ->
>                                          Column Qualifier.
>
>                                          Rather than using a
>                     TypedCombiner, I looked into
>                                          using an Aggregator (which
>                     appears deprecated as of
>                                          1.4), which appears to let me
>                     return arbitrary
>                                          values, but despite running
>                     setiter, my aggregator
>                                          doesn't seem to do anything.
>
>                                          I also tried looking at
>                     implementing a
>                                          WrappingIterator, which also
>                     appears to allow me to
>                                          return arbitary values (such as
>                     Accumulo's
>                                          CountingIterator), but I get
>                     cryptic errors when
>                                          trying to setiter, I'm on
>                     Accumulo 1.6:
>
>                                          root@dev kyt> setiter -t kyt
>                     -scan -p 10 -n
>                                          countingIter -class
>
>                     org.apache.accumulo.core.__iterators.system.__CountingIterator
>                                          2014-07-14 11:12:55,623
>                     [shell.Shell] ERROR:
>
>                     java.lang.__IllegalArgumentException:
>
>                     org.apache.accumulo.core.__iterators.system.__CountingIterator
>
>                                          This is odd because other
>                     included implementations
>                                          of WrappingIterator seem to
>                     work (perhaps the
>                                          implementation of
>                     CountingIterator is dated):
>                                          root@dev kyt> setiter -t kyt
>                     -scan -p 10 -n
>                                          deletingIterator -class
>
>                     org.apache.accumulo.core.__iterators.system.__DeletingIterator
>                                          The iterator class does not
>                     implement
>                                          OptionDescriber. Consider this
>                     for better iterator
>                                          configuration using this
>                     setiter command.
>                                          Name for iterator (enter to skip):
>
>                                          All in all, how can I aggregate
>                     simple values, like
>                                          counters from rows with complex
>                     Avro objects as
>                                          Values without having to add
>                     aggregations fields to
>                                          these Value objects?
>
>                                          Thanks!
>
>                                          -Mike
>
>
>
>
>
>
>
>
>
>

Mime
View raw message