accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <wilhelm.von.cl...@accumulo.net>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Mon, 14 Jul 2014 22:40:46 GMT
Anything in your Tserver log? I think you should just rethrow that
IOExcepton on your source's next() method, since they're usually not
recoverable (ie, just make Counter#next throw IOException)


On Mon, Jul 14, 2014 at 5:48 PM, Josh Elser <josh.elser@gmail.com> wrote:

> A quick sanity check is to make sure you have data in the table and that
> you can read the data without your iterator (I've thought I had a bug
> because I didn't have proper visibilities more times than I'd like to
> admit).
>
> Alternatively, you can also enable remote-debugging via Eclipse into the
> TabletServer which might help you understand more of what's going on.
>
> Lots of articles on how to set this up [1]. In short, add -Xdebug
> -Xrunjdwp:transport=dt_socket,server=y,address=8000 to
> ACCUMULO_TSERVER_OPTS in accumulo-env.sh, restart the tserver, connect
> eclipse to 8000 via the Debug configuration menu, set a breakpoint in your
> init, seek and next methods, and `scan` in the shell.
>
>
> [1] http://javarevisited.blogspot.com/2011/02/how-to-setup-
> remote-debugging-in.html
>
>
> On 7/14/14, 5:33 PM, Michael Moss wrote:
>
>> Hmm...Still doesn't return anything from the shell.
>>
>> http://pastebin.com/ndRhspf8
>>
>> Any thoughts? What's the best way to debug these?
>>
>>
>> On Mon, Jul 14, 2014 at 5:14 PM, William Slacum
>> <wilhelm.von.cloud@accumulo.net <mailto:wilhelm.von.cloud@accumulo.net>>
>>
>> wrote:
>>
>>     Ah, an artifact of me just willy nilly writing an iterator :) Any
>>     reference to `this.source` should be replaced with
>>     `this.getSource()`. In `next()`, your workaround ends up calling
>>     `this.hasTop()` as the while loop condition. It will always return
>>     false because two lines up we set `top_key` to null. We need to make
>>     sure that the source iterator has a top, because we want to read
>>     data from it. We'll have to change the loop condition to
>>     `while(this.getSource().hasTop())`. On line 38 of your code we'll
>>     need to call `this.getSource().next()` instead of `this.next()`.
>>
>>     The iterator interface is documented, but there hasn't been a
>>     definitive go-to for making one. I've been drafting a blog post, but
>>     since it doesn't exist yet, hopefully the following will suffice.
>>
>>     The lifetime of an iterator is (usually) as follows:
>>
>>     (1) A new instance is called via Class.newInstance (so a no-args
>>     constructor is needed)
>>     (2) Init is called. This allows users to configure the iterator, set
>>     its source, and possible check the environment. We can also call
>>     `deepCopy` on the source if we want to have multiple sources (we'd
>>     do this if we wanted to do a merge read out of multiple column
>>     families within a row).
>>     (3) seek() is called. This gets our readers to the correct positions
>>     in the data that are within the scan range the user requested, as
>>     well as turning column families on or off. The name should
>>     reminiscent of seeking to some key on disk.
>>     (4) hasTop() is called. If true, that means we have data, and the
>>     iterator has a key/value pair that can be retrieved by calling
>>     getTopKey() and getTopValue(). If fasle, we're done because there's
>>     no data to return.
>>     (5) next() is called. This will attempt find a new top key and
>>     value. We go back to (4) to see if next was successful in finding a
>>     new top key/value and will repeat until the client is satisfied or
>>     hasTop() returns false.
>>
>>     You can kind of make a state machine out of those steps where we
>>     loop between (4) and (5) until there's no data. There are more
>>     advanced workflows where next() can be reading from multiple
>>     sources, as well as seeking them to different positions in the tablet.
>>
>>
>>     On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss
>>     <michael.moss@gmail.com <mailto:michael.moss@gmail.com>> wrote:
>>
>>         Thanks, William. I was just hitting you up for an example :)
>>
>>         I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but
>>         noticed that "this.source" in your example didn't have
>>         visibility. Did I worked around it correctly?
>>
>>         When I add my iterator to my table and run scan from the shell,
>>         it returns nothing - what should I expect here? In general I've
>>         found the iterator interface pretty confusing and haven't spent
>>         the time wrapping my head around it yet. Any documentation or
>>         examples (beyond what I could find on the site or in the code)
>>         appreciated!
>>
>>         /root@dev> table pojo/
>>         /root@dev pojo> listiter -scan -t pojo/
>>         /-/
>>         /-    Iterator counter, scan scope options:/
>>         /-        iteratorPriority = 10/
>>         /-        iteratorClassName = iterators.Counter/
>>         /-/
>>         /root@dev pojo> scan/
>>         /root@dev pojo>/
>>
>>
>>         Best,
>>
>>         -Mike
>>
>>
>>
>>
>>         On Mon, Jul 14, 2014 at 4:07 PM, William Slacum
>>         <wilhelm.von.cloud@accumulo.net
>>         <mailto:wilhelm.von.cloud@accumulo.net>> wrote:
>>
>>             For a bit of psuedocode, I'd probably make a class that did
>>             something akin to: http://pastebin.com/pKqAeeCR
>>
>>             I wrote that up real quick in a text editor-- it won't
>>             compile or anything, but should point you in the right
>>             direction.
>>
>>
>>             On Mon, Jul 14, 2014 at 3:44 PM, William Slacum
>>             <wilhelm.von.cloud@accumulo.net
>>             <mailto:wilhelm.von.cloud@accumulo.net>> wrote:
>>
>>                 Hi Mike!
>>
>>                 The Combiner interface is only for aggregating keys
>>                 within a single row. You can probably get away with
>>                 implementing your combining logic in a WrappingIterator
>>                 that reads across all the rows in a given tablet.
>>
>>                 To do some combine/fold/reduce operation, Accumulo needs
>>                 the input type to be the same as the output type. The
>>                 combiner doesn't have a notion of a "present" type (as
>>                 you'd see in something like Algebird's Groups), but you
>>                 can use another iterator to perform your transformation.
>>
>>                 If you wanted to extract the "count" field from your
>>                 Avro object, you could write a new Iterator that took
>>                 your Avro object, extracted the desired field, and
>>                 returned it as its top value. You can then set this
>>                 iterator as the source of the aggregator, either
>>                 programmatically or via by wrapping the source object
>>                 passed to the aggregator in its
>>                 SortedKeyValueIterator#init call.
>>
>>                 This is a bit inefficient as you'd have to serialize to
>>                 a Value and then immediately deserialize it in the
>>                 iterator above it. You could mitigate this by exposing a
>>                 method that would get the extracted value before
>>                 serializing it.
>>
>>                 This kind of counting also requires client side logic to
>>                 do a final combine operation, since the aggregations
>>                 from all the tservers are partial results.
>>
>>                 I believe that CountingIterator is not meant for user
>>                 consumption, but I do not know if it's related to your
>>                 issue in trying to use it from the shell. Iterators set
>>                 through the shell, in previous versions of Accumulo,
>>                 have a requirement to implement OptionDescriber. Many
>>                 default iterators do not implement this, and thus can't
>>                 set in the shell.
>>
>>
>>
>>                 On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss
>>                 <michael.moss@gmail.com <mailto:michael.moss@gmail.com>>
>>
>>                 wrote:
>>
>>                     Hi, All.
>>
>>                     I'm curious what the best practices are around
>>                     persisting complex types/data in Accumulo (and
>>                     aggregating on fields within them).
>>
>>                     Let's say I have (row, column family, column
>>                     qualifier, value):
>>                     "A" "foo" "" MyHugeAvroObject(count=2)
>>                     "A" "foo" "" MyHugeAvroObject(count=3)
>>
>>                     Let's say MyHugeAvroObject has a field "Integer
>>                     count" with the values above.
>>
>>                     What is the best way to aggregate on row, column
>>                     family, column qualifier by count? In my above
>> example:
>>                     "A" "foo" "" 5
>>
>>                     The TypedValueCombiner.typedReduce method can
>>                     deserialize any "V", in my case MyHugeAvroObject,
>>                     but it needs to return a value of type "V". What are
>>                     the best practices for deeply nested/complex
>>                     objects? It's not always straightforward to map a
>>                     complex Avro type into Row -> Column Family ->
>>                     Column Qualifier.
>>
>>                     Rather than using a TypedCombiner, I looked into
>>                     using an Aggregator (which appears deprecated as of
>>                     1.4), which appears to let me return arbitrary
>>                     values, but despite running setiter, my aggregator
>>                     doesn't seem to do anything.
>>
>>                     I also tried looking at implementing a
>>                     WrappingIterator, which also appears to allow me to
>>                     return arbitary values (such as Accumulo's
>>                     CountingIterator), but I get cryptic errors when
>>                     trying to setiter, I'm on Accumulo 1.6:
>>
>>                     root@dev kyt> setiter -t kyt -scan -p 10 -n
>>                     countingIter -class
>>                     org.apache.accumulo.core.iterators.system.
>> CountingIterator
>>                     2014-07-14 11:12:55,623 [shell.Shell] ERROR:
>>                     java.lang.IllegalArgumentException:
>>                     org.apache.accumulo.core.iterators.system.
>> CountingIterator
>>
>>                     This is odd because other included implementations
>>                     of WrappingIterator seem to work (perhaps the
>>                     implementation of CountingIterator is dated):
>>                     root@dev kyt> setiter -t kyt -scan -p 10 -n
>>                     deletingIterator -class
>>                     org.apache.accumulo.core.iterators.system.
>> DeletingIterator
>>                     The iterator class does not implement
>>                     OptionDescriber. Consider this for better iterator
>>                     configuration using this setiter command.
>>                     Name for iterator (enter to skip):
>>
>>                     All in all, how can I aggregate simple values, like
>>                     counters from rows with complex Avro objects as
>>                     Values without having to add aggregations fields to
>>                     these Value objects?
>>
>>                     Thanks!
>>
>>                     -Mike
>>
>>
>>
>>
>>
>>
>>

Mime
View raw message