accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Mon, 14 Jul 2014 21:48:36 GMT
A quick sanity check is to make sure you have data in the table and that 
you can read the data without your iterator (I've thought I had a bug 
because I didn't have proper visibilities more times than I'd like to 

Alternatively, you can also enable remote-debugging via Eclipse into the 
TabletServer which might help you understand more of what's going on.

Lots of articles on how to set this up [1]. In short, add -Xdebug 
-Xrunjdwp:transport=dt_socket,server=y,address=8000 to 
ACCUMULO_TSERVER_OPTS in, restart the tserver, connect 
eclipse to 8000 via the Debug configuration menu, set a breakpoint in 
your init, seek and next methods, and `scan` in the shell.


On 7/14/14, 5:33 PM, Michael Moss wrote:
> Hmm...Still doesn't return anything from the shell.
> Any thoughts? What's the best way to debug these?
> On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
> < <>>
> wrote:
>     Ah, an artifact of me just willy nilly writing an iterator :) Any
>     reference to `this.source` should be replaced with
>     `this.getSource()`. In `next()`, your workaround ends up calling
>     `this.hasTop()` as the while loop condition. It will always return
>     false because two lines up we set `top_key` to null. We need to make
>     sure that the source iterator has a top, because we want to read
>     data from it. We'll have to change the loop condition to
>     `while(this.getSource().hasTop())`. On line 38 of your code we'll
>     need to call `this.getSource().next()` instead of ``.
>     The iterator interface is documented, but there hasn't been a
>     definitive go-to for making one. I've been drafting a blog post, but
>     since it doesn't exist yet, hopefully the following will suffice.
>     The lifetime of an iterator is (usually) as follows:
>     (1) A new instance is called via Class.newInstance (so a no-args
>     constructor is needed)
>     (2) Init is called. This allows users to configure the iterator, set
>     its source, and possible check the environment. We can also call
>     `deepCopy` on the source if we want to have multiple sources (we'd
>     do this if we wanted to do a merge read out of multiple column
>     families within a row).
>     (3) seek() is called. This gets our readers to the correct positions
>     in the data that are within the scan range the user requested, as
>     well as turning column families on or off. The name should
>     reminiscent of seeking to some key on disk.
>     (4) hasTop() is called. If true, that means we have data, and the
>     iterator has a key/value pair that can be retrieved by calling
>     getTopKey() and getTopValue(). If fasle, we're done because there's
>     no data to return.
>     (5) next() is called. This will attempt find a new top key and
>     value. We go back to (4) to see if next was successful in finding a
>     new top key/value and will repeat until the client is satisfied or
>     hasTop() returns false.
>     You can kind of make a state machine out of those steps where we
>     loop between (4) and (5) until there's no data. There are more
>     advanced workflows where next() can be reading from multiple
>     sources, as well as seeking them to different positions in the tablet.
>     On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>     < <>> wrote:
>         Thanks, William. I was just hitting you up for an example :)
>         I adapted your pseudocode (, but
>         noticed that "this.source" in your example didn't have
>         visibility. Did I worked around it correctly?
>         When I add my iterator to my table and run scan from the shell,
>         it returns nothing - what should I expect here? In general I've
>         found the iterator interface pretty confusing and haven't spent
>         the time wrapping my head around it yet. Any documentation or
>         examples (beyond what I could find on the site or in the code)
>         appreciated!
>         /root@dev> table pojo/
>         /root@dev pojo> listiter -scan -t pojo/
>         /-/
>         /-    Iterator counter, scan scope options:/
>         /-        iteratorPriority = 10/
>         /-        iteratorClassName = iterators.Counter/
>         /-/
>         /root@dev pojo> scan/
>         /root@dev pojo>/
>         Best,
>         -Mike
>         On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
>         <
>         <>> wrote:
>             For a bit of psuedocode, I'd probably make a class that did
>             something akin to:
>             I wrote that up real quick in a text editor-- it won't
>             compile or anything, but should point you in the right
>             direction.
>             On Mon, Jul 14, 2014 at 3:44 PM, William Slacum
>             <
>             <>> wrote:
>                 Hi Mike!
>                 The Combiner interface is only for aggregating keys
>                 within a single row. You can probably get away with
>                 implementing your combining logic in a WrappingIterator
>                 that reads across all the rows in a given tablet.
>                 To do some combine/fold/reduce operation, Accumulo needs
>                 the input type to be the same as the output type. The
>                 combiner doesn't have a notion of a "present" type (as
>                 you'd see in something like Algebird's Groups), but you
>                 can use another iterator to perform your transformation.
>                 If you wanted to extract the "count" field from your
>                 Avro object, you could write a new Iterator that took
>                 your Avro object, extracted the desired field, and
>                 returned it as its top value. You can then set this
>                 iterator as the source of the aggregator, either
>                 programmatically or via by wrapping the source object
>                 passed to the aggregator in its
>                 SortedKeyValueIterator#init call.
>                 This is a bit inefficient as you'd have to serialize to
>                 a Value and then immediately deserialize it in the
>                 iterator above it. You could mitigate this by exposing a
>                 method that would get the extracted value before
>                 serializing it.
>                 This kind of counting also requires client side logic to
>                 do a final combine operation, since the aggregations
>                 from all the tservers are partial results.
>                 I believe that CountingIterator is not meant for user
>                 consumption, but I do not know if it's related to your
>                 issue in trying to use it from the shell. Iterators set
>                 through the shell, in previous versions of Accumulo,
>                 have a requirement to implement OptionDescriber. Many
>                 default iterators do not implement this, and thus can't
>                 set in the shell.
>                 On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss
>                 < <>>
>                 wrote:
>                     Hi, All.
>                     I'm curious what the best practices are around
>                     persisting complex types/data in Accumulo (and
>                     aggregating on fields within them).
>                     Let's say I have (row, column family, column
>                     qualifier, value):
>                     "A" "foo" "" MyHugeAvroObject(count=2)
>                     "A" "foo" "" MyHugeAvroObject(count=3)
>                     Let's say MyHugeAvroObject has a field "Integer
>                     count" with the values above.
>                     What is the best way to aggregate on row, column
>                     family, column qualifier by count? In my above example:
>                     "A" "foo" "" 5
>                     The TypedValueCombiner.typedReduce method can
>                     deserialize any "V", in my case MyHugeAvroObject,
>                     but it needs to return a value of type "V". What are
>                     the best practices for deeply nested/complex
>                     objects? It's not always straightforward to map a
>                     complex Avro type into Row -> Column Family ->
>                     Column Qualifier.
>                     Rather than using a TypedCombiner, I looked into
>                     using an Aggregator (which appears deprecated as of
>                     1.4), which appears to let me return arbitrary
>                     values, but despite running setiter, my aggregator
>                     doesn't seem to do anything.
>                     I also tried looking at implementing a
>                     WrappingIterator, which also appears to allow me to
>                     return arbitary values (such as Accumulo's
>                     CountingIterator), but I get cryptic errors when
>                     trying to setiter, I'm on Accumulo 1.6:
>                     root@dev kyt> setiter -t kyt -scan -p 10 -n
>                     countingIter -class
>                     org.apache.accumulo.core.iterators.system.CountingIterator
>                     2014-07-14 11:12:55,623 [shell.Shell] ERROR:
>                     java.lang.IllegalArgumentException:
>                     org.apache.accumulo.core.iterators.system.CountingIterator
>                     This is odd because other included implementations
>                     of WrappingIterator seem to work (perhaps the
>                     implementation of CountingIterator is dated):
>                     root@dev kyt> setiter -t kyt -scan -p 10 -n
>                     deletingIterator -class
>                     org.apache.accumulo.core.iterators.system.DeletingIterator
>                     The iterator class does not implement
>                     OptionDescriber. Consider this for better iterator
>                     configuration using this setiter command.
>                     Name for iterator (enter to skip):
>                     All in all, how can I aggregate simple values, like
>                     counters from rows with complex Avro objects as
>                     Values without having to add aggregations fields to
>                     these Value objects?
>                     Thanks!
>                     -Mike

View raw message