accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <michael.m...@gmail.com>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Tue, 15 Jul 2014 18:08:17 GMT
Cool. I'll write something up and share.

I'm curious how to get my Counter (WrappingIterator) implementation to
aggregate by row (which, for some reason, I assumed was default?)

Let's say I have rows (and CF="", CQ="" and versioningiterator off):
1 (Value1, Value 2...Value N)
2
3

How can my iterator return?
1 (Count of values 1..N)
2 (Count of values 1..N)
3 ...

I tried scan -b "1" -e "1" and it counts an individual row. But if I don't
specify anything, it returns,
3 (Count of all values across all rows)

Code:
http://pastebin.com/8xFNLHFS

Example:
root@dev pe> listiter -scan -t pojo
-
-    Iterator counter, scan scope options:
-        iteratorPriority = 10
-        iteratorClassName = iterators.Counter
-
root@dev pe> scan -b "1_1_20140101" -e "1_1_20140101"
1_1_20140101 : [public]    65

root@dev pe> scan -b "1_1_20140101" -e "3_9_20140727"
3_9_20140727 : [public]    100000

root@dev pe> scan
3_9_20140727 : [public]    100000


Thanks.

-Mike



On Tue, Jul 15, 2014 at 12:29 PM, Josh Elser <josh.elser@gmail.com> wrote:

> There's been some mention about a desire to rethink the Iterator interface
> as it has some deficiencies (notably the lack of a "cleanup" before the
> iterators are torn down), but no one has stated that they're actively
> working on this.
>
> Getting better documentation wrt to convetions: let us know where the
> Accumulo documentation falls short (and give us patches to fix the
> documentation :D). Additionally, write up your own findings from problems
> that you've run into. It's the entire community (users specifically) that
> we need to help encourage to grow.
>
> Even things as simple as "how do I count entries in an iterator" are big
> as you are now an "expert" on the subject :)
>
>
> On 7/15/14, 12:17 PM, Michael Moss wrote:
>
>> That worked ;) - Thanks!
>>
>> What a journey...
>>
>> I like Accumulo's architecture and promise, but the difficulty in
>> querying it (lack of documentation, conventions) is a major concern and
>> I'd imagine has to have an impact on adoption. I'm curious if there have
>> been any conversations around changing the interface around iterators
>> which are still confusing to me. Let me know how I can help!
>>
>>
>> On Tue, Jul 15, 2014 at 12:03 PM, William Slacum
>> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net>>
>>
>> wrote:
>>
>>     Herp... serves me right for not setting up a proper test case.
>>
>>     I think you need to override seek as well:
>>
>>     @Override
>>     public void seek(...) throws IOException {
>>        super.seek(...);
>>        next();
>>     }
>>
>>     I think I just realized the wrapping iterator could use some clean
>>     up, because this isn't obvious. Basically after the wrapping
>>     iterator's seek is called, it never calls the implementor's next()
>>     to actually set up the first top key and value.
>>
>>
>>
>>     On Tue, Jul 15, 2014 at 9:50 AM, Michael Moss
>>     <michael.moss@gmail.com <mailto:michael.moss@gmail.com>> wrote:
>>
>>         I set up debugging and am rethrowing the exception. What's
>>         strange is it appears that despite the iterator instance being
>>         properly set to iterator.Counter (my implementation), my
>>         breakpoints aren't being hit, only in the parent classes
>>         (Wrapping Iterator) and (SortedKeyValueIterator).
>>
>>         I have two rows in the table, when I scan with no iterator:
>>         2014-07-15 06:46:26,577 [Audit   ] INFO : operation: permitted;
>>         user: root; action: scan; targetTable: pojo; authorizations:
>>         public,; range: (-inf,+inf); columns: []; iterators: [];
>>         iteratorOptions: {};
>>         2014-07-15 06:46:26,589 [tserver.TabletServer] DEBUG: ScanSess
>>         tid 10.0.2.15:45073 <http://10.0.2.15:45073> 8*2 entries* in
>>
>>         0.01 secs, nbTimes = [7 7 7.00 1]
>>
>>         When I scan with the iterator (0 entries?):
>>         2014-07-15 06:45:58,036 [Audit   ] INFO : operation: permitted;
>>         user: root; action: scan; targetTable: pojo; authorizations:
>>         public,; range: (-inf,+inf); columns: []; iterators: [];
>>         iteratorOptions: {};
>>         2014-07-15 06:45:58,047 [tserver.TabletServer] DEBUG: ScanSess
>>         tid 10.0.2.15:44992 <http://10.0.2.15:44992> 8 *0 entries* in
>>
>>         0.01 secs, nbTimes = [6 6 6.00 1]
>>
>>         No exceptions otherwise. Really appreciate all the ongoing help.
>>
>>         Best,
>>
>>         -Mike
>>
>>
>>         On Mon, Jul 14, 2014 at 6:40 PM, William Slacum
>>         <wilhelm.von.cloud@accumulo.net
>>         <mailto:wilhelm.von.cloud@accumulo.net>> wrote:
>>
>>             Anything in your Tserver log? I think you should just
>>             rethrow that IOExcepton on your source's next() method,
>>             since they're usually not recoverable (ie, just make
>>             Counter#next throw IOException)
>>
>>
>>             On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser
>>             <josh.elser@gmail.com <mailto:josh.elser@gmail.com>> wrote:
>>
>>                 A quick sanity check is to make sure you have data in
>>                 the table and that you can read the data without your
>>                 iterator (I've thought I had a bug because I didn't have
>>                 proper visibilities more times than I'd like to admit).
>>
>>                 Alternatively, you can also enable remote-debugging via
>>                 Eclipse into the TabletServer which might help you
>>                 understand more of what's going on.
>>
>>                 Lots of articles on how to set this up [1]. In short,
>>                 add -Xdebug
>>                 -Xrunjdwp:transport=dt_socket,__server=y,address=8000 to
>>
>>                 ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the
>>                 tserver, connect eclipse to 8000 via the Debug
>>                 configuration menu, set a breakpoint in your init, seek
>>                 and next methods, and `scan` in the shell.
>>
>>
>>                 [1]
>>                 http://javarevisited.blogspot.
>> __com/2011/02/how-to-setup-__remote-debugging-in.html
>>
>>                 <http://javarevisited.blogspot.com/2011/02/how-to-
>> setup-remote-debugging-in.html>
>>
>>
>>                 On 7/14/14, 5:33 PM, Michael Moss wrote:
>>
>>                     Hmm...Still doesn't return anything from the shell.
>>
>>                     http://pastebin.com/ndRhspf8
>>
>>                     Any thoughts? What's the best way to debug these?
>>
>>
>>                     On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
>>                     <wilhelm.von.cloud@accumulo.__net
>>                     <mailto:wilhelm.von.cloud@accumulo.net>
>>                     <mailto:wilhelm.von.cloud@__accumulo.net
>>
>>                     <mailto:wilhelm.von.cloud@accumulo.net>>>
>>
>>                     wrote:
>>
>>                          Ah, an artifact of me just willy nilly writing
>>                     an iterator :) Any
>>                          reference to `this.source` should be replaced
>> with
>>                          `this.getSource()`. In `next()`, your
>>                     workaround ends up calling
>>                          `this.hasTop()` as the while loop condition. It
>>                     will always return
>>                          false because two lines up we set `top_key` to
>>                     null. We need to make
>>                          sure that the source iterator has a top,
>>                     because we want to read
>>                          data from it. We'll have to change the loop
>>                     condition to
>>                          `while(this.getSource().__hasTop())`. On line
>>
>>                     38 of your code we'll
>>                          need to call `this.getSource().next()` instead
>>                     of `this.next()`.
>>
>>                          The iterator interface is documented, but there
>>                     hasn't been a
>>                          definitive go-to for making one. I've been
>>                     drafting a blog post, but
>>                          since it doesn't exist yet, hopefully the
>>                     following will suffice.
>>
>>                          The lifetime of an iterator is (usually) as
>>                     follows:
>>
>>                          (1) A new instance is called via
>>                     Class.newInstance (so a no-args
>>                          constructor is needed)
>>                          (2) Init is called. This allows users to
>>                     configure the iterator, set
>>                          its source, and possible check the environment.
>>                     We can also call
>>                          `deepCopy` on the source if we want to have
>>                     multiple sources (we'd
>>                          do this if we wanted to do a merge read out of
>>                     multiple column
>>                          families within a row).
>>                          (3) seek() is called. This gets our readers to
>>                     the correct positions
>>                          in the data that are within the scan range the
>>                     user requested, as
>>                          well as turning column families on or off. The
>>                     name should
>>                          reminiscent of seeking to some key on disk.
>>                          (4) hasTop() is called. If true, that means we
>>                     have data, and the
>>                          iterator has a key/value pair that can be
>>                     retrieved by calling
>>                          getTopKey() and getTopValue(). If fasle, we're
>>                     done because there's
>>                          no data to return.
>>                          (5) next() is called. This will attempt find a
>>                     new top key and
>>                          value. We go back to (4) to see if next was
>>                     successful in finding a
>>                          new top key/value and will repeat until the
>>                     client is satisfied or
>>                          hasTop() returns false.
>>
>>                          You can kind of make a state machine out of
>>                     those steps where we
>>                          loop between (4) and (5) until there's no data.
>>                     There are more
>>                          advanced workflows where next() can be reading
>>                     from multiple
>>                          sources, as well as seeking them to different
>>                     positions in the tablet.
>>
>>
>>                          On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>>                          <michael.moss@gmail.com
>>                     <mailto:michael.moss@gmail.com>
>>                     <mailto:michael.moss@gmail.com
>>
>>                     <mailto:michael.moss@gmail.com>__>> wrote:
>>
>>                              Thanks, William. I was just hitting you up
>>                     for an example :)
>>
>>                              I adapted your pseudocode
>>                     (http://pastebin.com/ufPJq0g3)__, but
>>
>>                              noticed that "this.source" in your example
>>                     didn't have
>>                              visibility. Did I worked around it correctly?
>>
>>                              When I add my iterator to my table and run
>>                     scan from the shell,
>>                              it returns nothing - what should I expect
>>                     here? In general I've
>>                              found the iterator interface pretty
>>                     confusing and haven't spent
>>                              the time wrapping my head around it yet.
>>                     Any documentation or
>>                              examples (beyond what I could find on the
>>                     site or in the code)
>>                              appreciated!
>>
>>                              /root@dev> table pojo/
>>                              /root@dev pojo> listiter -scan -t pojo/
>>                              /-/
>>                              /-    Iterator counter, scan scope options:/
>>                              /-        iteratorPriority = 10/
>>                              /-        iteratorClassName =
>>                     iterators.Counter/
>>                              /-/
>>                              /root@dev pojo> scan/
>>                              /root@dev pojo>/
>>
>>
>>                              Best,
>>
>>                              -Mike
>>
>>
>>
>>
>>                              On Mon, Jul 14, 2014 at 4:07 PM, William
>> Slacum
>>                              <wilhelm.von.cloud@accumulo.__net
>>                     <mailto:wilhelm.von.cloud@accumulo.net>
>>                              <mailto:wilhelm.von.cloud@__accumulo.net
>>
>>                     <mailto:wilhelm.von.cloud@accumulo.net>>> wrote:
>>
>>                                  For a bit of psuedocode, I'd probably
>>                     make a class that did
>>                                  something akin to:
>>                     http://pastebin.com/pKqAeeCR
>>
>>                                  I wrote that up real quick in a text
>>                     editor-- it won't
>>                                  compile or anything, but should point
>>                     you in the right
>>                                  direction.
>>
>>
>>                                  On Mon, Jul 14, 2014 at 3:44 PM,
>>                     William Slacum
>>                                  <wilhelm.von.cloud@accumulo.__net
>>                     <mailto:wilhelm.von.cloud@accumulo.net>
>>
>>                     <mailto:wilhelm.von.cloud@__accumulo.net
>>
>>                     <mailto:wilhelm.von.cloud@accumulo.net>>> wrote:
>>
>>                                      Hi Mike!
>>
>>                                      The Combiner interface is only for
>>                     aggregating keys
>>                                      within a single row. You can
>>                     probably get away with
>>                                      implementing your combining logic
>>                     in a WrappingIterator
>>                                      that reads across all the rows in a
>>                     given tablet.
>>
>>                                      To do some combine/fold/reduce
>>                     operation, Accumulo needs
>>                                      the input type to be the same as
>>                     the output type. The
>>                                      combiner doesn't have a notion of a
>>                     "present" type (as
>>                                      you'd see in something like
>>                     Algebird's Groups), but you
>>                                      can use another iterator to perform
>>                     your transformation.
>>
>>                                      If you wanted to extract the
>>                     "count" field from your
>>                                      Avro object, you could write a new
>>                     Iterator that took
>>                                      your Avro object, extracted the
>>                     desired field, and
>>                                      returned it as its top value. You
>>                     can then set this
>>                                      iterator as the source of the
>>                     aggregator, either
>>                                      programmatically or via by wrapping
>>                     the source object
>>                                      passed to the aggregator in its
>>                                      SortedKeyValueIterator#init call.
>>
>>                                      This is a bit inefficient as you'd
>>                     have to serialize to
>>                                      a Value and then immediately
>>                     deserialize it in the
>>                                      iterator above it. You could
>>                     mitigate this by exposing a
>>                                      method that would get the extracted
>>                     value before
>>                                      serializing it.
>>
>>                                      This kind of counting also requires
>>                     client side logic to
>>                                      do a final combine operation, since
>>                     the aggregations
>>                                      from all the tservers are partial
>>                     results.
>>
>>                                      I believe that CountingIterator is
>>                     not meant for user
>>                                      consumption, but I do not know if
>>                     it's related to your
>>                                      issue in trying to use it from the
>>                     shell. Iterators set
>>                                      through the shell, in previous
>>                     versions of Accumulo,
>>                                      have a requirement to implement
>>                     OptionDescriber. Many
>>                                      default iterators do not implement
>>                     this, and thus can't
>>                                      set in the shell.
>>
>>
>>
>>                                      On Mon, Jul 14, 2014 at 2:44 PM,
>>                     Michael Moss
>>                                      <michael.moss@gmail.com
>>                     <mailto:michael.moss@gmail.com>
>>                     <mailto:michael.moss@gmail.com
>>                     <mailto:michael.moss@gmail.com>__>>
>>
>>
>>                                      wrote:
>>
>>                                          Hi, All.
>>
>>                                          I'm curious what the best
>>                     practices are around
>>                                          persisting complex types/data
>>                     in Accumulo (and
>>                                          aggregating on fields within
>> them).
>>
>>                                          Let's say I have (row, column
>>                     family, column
>>                                          qualifier, value):
>>                                          "A" "foo" ""
>>                     MyHugeAvroObject(count=2)
>>                                          "A" "foo" ""
>>                     MyHugeAvroObject(count=3)
>>
>>                                          Let's say MyHugeAvroObject has
>>                     a field "Integer
>>                                          count" with the values above.
>>
>>                                          What is the best way to
>>                     aggregate on row, column
>>                                          family, column qualifier by
>>                     count? In my above example:
>>                                          "A" "foo" "" 5
>>
>>                                          The
>>                     TypedValueCombiner.typedReduce method can
>>                                          deserialize any "V", in my case
>>                     MyHugeAvroObject,
>>                                          but it needs to return a value
>>                     of type "V". What are
>>                                          the best practices for deeply
>>                     nested/complex
>>                                          objects? It's not always
>>                     straightforward to map a
>>                                          complex Avro type into Row ->
>>                     Column Family ->
>>                                          Column Qualifier.
>>
>>                                          Rather than using a
>>                     TypedCombiner, I looked into
>>                                          using an Aggregator (which
>>                     appears deprecated as of
>>                                          1.4), which appears to let me
>>                     return arbitrary
>>                                          values, but despite running
>>                     setiter, my aggregator
>>                                          doesn't seem to do anything.
>>
>>                                          I also tried looking at
>>                     implementing a
>>                                          WrappingIterator, which also
>>                     appears to allow me to
>>                                          return arbitary values (such as
>>                     Accumulo's
>>                                          CountingIterator), but I get
>>                     cryptic errors when
>>                                          trying to setiter, I'm on
>>                     Accumulo 1.6:
>>
>>                                          root@dev kyt> setiter -t kyt
>>                     -scan -p 10 -n
>>                                          countingIter -class
>>
>>                     org.apache.accumulo.core.__iterators.system.__
>> CountingIterator
>>
>>                                          2014-07-14 11:12:55,623
>>                     [shell.Shell] ERROR:
>>
>>                     java.lang.__IllegalArgumentException:
>>
>>                     org.apache.accumulo.core.__iterators.system.__
>> CountingIterator
>>
>>
>>                                          This is odd because other
>>                     included implementations
>>                                          of WrappingIterator seem to
>>                     work (perhaps the
>>                                          implementation of
>>                     CountingIterator is dated):
>>                                          root@dev kyt> setiter -t kyt
>>                     -scan -p 10 -n
>>                                          deletingIterator -class
>>
>>                     org.apache.accumulo.core.__iterators.system.__
>> DeletingIterator
>>
>>                                          The iterator class does not
>>                     implement
>>                                          OptionDescriber. Consider this
>>                     for better iterator
>>                                          configuration using this
>>                     setiter command.
>>                                          Name for iterator (enter to
>> skip):
>>
>>                                          All in all, how can I aggregate
>>                     simple values, like
>>                                          counters from rows with complex
>>                     Avro objects as
>>                                          Values without having to add
>>                     aggregations fields to
>>                                          these Value objects?
>>
>>                                          Thanks!
>>
>>                                          -Mike
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Mime
View raw message