accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Mon, 14 Jul 2014 21:33:21 GMT
Hmm...Still doesn't return anything from the shell.

Any thoughts? What's the best way to debug these?

On Mon, Jul 14, 2014 at 5:14 PM, William Slacum <> wrote:

> Ah, an artifact of me just willy nilly writing an iterator :) Any
> reference to `this.source` should be replaced with `this.getSource()`. In
> `next()`, your workaround ends up calling `this.hasTop()` as the while loop
> condition. It will always return false because two lines up we set
> `top_key` to null. We need to make sure that the source iterator has a top,
> because we want to read data from it. We'll have to change the loop
> condition to `while(this.getSource().hasTop())`. On line 38 of your code
> we'll need to call `this.getSource().next()` instead of ``.
> The iterator interface is documented, but there hasn't been a definitive
> go-to for making one. I've been drafting a blog post, but since it doesn't
> exist yet, hopefully the following will suffice.
> The lifetime of an iterator is (usually) as follows:
> (1) A new instance is called via Class.newInstance (so a no-args
> constructor is needed)
> (2) Init is called. This allows users to configure the iterator, set its
> source, and possible check the environment. We can also call `deepCopy` on
> the source if we want to have multiple sources (we'd do this if we wanted
> to do a merge read out of multiple column families within a row).
> (3) seek() is called. This gets our readers to the correct positions in
> the data that are within the scan range the user requested, as well as
> turning column families on or off. The name should reminiscent of seeking
> to some key on disk.
> (4) hasTop() is called. If true, that means we have data, and the iterator
> has a key/value pair that can be retrieved by calling getTopKey() and
> getTopValue(). If fasle, we're done because there's no data to return.
> (5) next() is called. This will attempt find a new top key and value. We
> go back to (4) to see if next was successful in finding a new top key/value
> and will repeat until the client is satisfied or hasTop() returns false.
> You can kind of make a state machine out of those steps where we loop
> between (4) and (5) until there's no data. There are more advanced
> workflows where next() can be reading from multiple sources, as well as
> seeking them to different positions in the tablet.
> On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss <>
> wrote:
>> Thanks, William. I was just hitting you up for an example :)
>> I adapted your pseudocode (, but noticed
>> that "this.source" in your example didn't have visibility. Did I worked
>> around it correctly?
>> When I add my iterator to my table and run scan from the shell, it
>> returns nothing - what should I expect here? In general I've found the
>> iterator interface pretty confusing and haven't spent the time wrapping my
>> head around it yet. Any documentation or examples (beyond what I could find
>> on the site or in the code) appreciated!
>> *root@dev> table pojo*
>> *root@dev pojo> listiter -scan -t pojo*
>> *-*
>> *-    Iterator counter, scan scope options:*
>> *-        iteratorPriority = 10*
>> *-        iteratorClassName = iterators.Counter*
>> *-*
>> *root@dev pojo> scan*
>> *root@dev pojo>*
>> Best,
>> -Mike
>> On Mon, Jul 14, 2014 at 4:07 PM, William Slacum <
>>> wrote:
>>> For a bit of psuedocode, I'd probably make a class that did something
>>> akin to:
>>> I wrote that up real quick in a text editor-- it won't compile or
>>> anything, but should point you in the right direction.
>>> On Mon, Jul 14, 2014 at 3:44 PM, William Slacum <
>>>> wrote:
>>>> Hi Mike!
>>>> The Combiner interface is only for aggregating keys within a single
>>>> row. You can probably get away with implementing your combining logic in
>>>> WrappingIterator that reads across all the rows in a given tablet.
>>>> To do some combine/fold/reduce operation, Accumulo needs the input type
>>>> to be the same as the output type. The combiner doesn't have a notion of
>>>> "present" type (as you'd see in something like Algebird's Groups), but you
>>>> can use another iterator to perform your transformation.
>>>> If you wanted to extract the "count" field from your Avro object, you
>>>> could write a new Iterator that took your Avro object, extracted the
>>>> desired field, and returned it as its top value. You can then set this
>>>> iterator as the source of the aggregator, either programmatically or via
>>>> wrapping the source object passed to the aggregator in its
>>>> SortedKeyValueIterator#init call.
>>>> This is a bit inefficient as you'd have to serialize to a Value and
>>>> then immediately deserialize it in the iterator above it. You could
>>>> mitigate this by exposing a method that would get the extracted value
>>>> before serializing it.
>>>> This kind of counting also requires client side logic to do a final
>>>> combine operation, since the aggregations from all the tservers are partial
>>>> results.
>>>> I believe that CountingIterator is not meant for user consumption, but
>>>> I do not know if it's related to your issue in trying to use it from the
>>>> shell. Iterators set through the shell, in previous versions of Accumulo,
>>>> have a requirement to implement OptionDescriber. Many default iterators do
>>>> not implement this, and thus can't set in the shell.
>>>> On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss <>
>>>> wrote:
>>>>> Hi, All.
>>>>> I'm curious what the best practices are around persisting complex
>>>>> types/data in Accumulo (and aggregating on fields within them).
>>>>> Let's say I have (row, column family, column qualifier, value):
>>>>> "A" "foo" "" MyHugeAvroObject(count=2)
>>>>> "A" "foo" "" MyHugeAvroObject(count=3)
>>>>> Let's say MyHugeAvroObject has a field "Integer count" with the values
>>>>> above.
>>>>> What is the best way to aggregate on row, column family, column
>>>>> qualifier by count? In my above example:
>>>>> "A" "foo" "" 5
>>>>> The TypedValueCombiner.typedReduce method can deserialize any "V", in
>>>>> my case MyHugeAvroObject, but it needs to return a value of type "V".
>>>>> are the best practices for deeply nested/complex objects? It's not always
>>>>> straightforward to map a complex Avro type into Row -> Column Family
>>>>> Column Qualifier.
>>>>> Rather than using a TypedCombiner, I looked into using an Aggregator
>>>>> (which appears deprecated as of 1.4), which appears to let me return
>>>>> arbitrary values, but despite running setiter, my aggregator doesn't
>>>>> to do anything.
>>>>> I also tried looking at implementing a WrappingIterator, which also
>>>>> appears to allow me to return arbitary values (such as Accumulo's
>>>>> CountingIterator), but I get cryptic errors when trying to setiter, I'm
>>>>> Accumulo 1.6:
>>>>> root@dev kyt> setiter -t kyt -scan -p 10 -n countingIter -class
>>>>> org.apache.accumulo.core.iterators.system.CountingIterator
>>>>> 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
>>>>> java.lang.IllegalArgumentException:
>>>>> org.apache.accumulo.core.iterators.system.CountingIterator
>>>>> This is odd because other included implementations of WrappingIterator
>>>>> seem to work (perhaps the implementation of CountingIterator is dated):
>>>>> root@dev kyt> setiter -t kyt -scan -p 10 -n deletingIterator -class
>>>>> org.apache.accumulo.core.iterators.system.DeletingIterator
>>>>> The iterator class does not implement OptionDescriber. Consider this
>>>>> for better iterator configuration using this setiter command.
>>>>> Name for iterator (enter to skip):
>>>>> All in all, how can I aggregate simple values, like counters from rows
>>>>> with complex Avro objects as Values without having to add aggregations
>>>>> fields to these Value objects?
>>>>> Thanks!
>>>>> -Mike

View raw message