accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <wilhelm.von.cl...@accumulo.net>
Subject Re: Iterating/Aggregating/Combining Complex (Java POJO/Avro) Values
Date Mon, 14 Jul 2014 21:14:51 GMT
Ah, an artifact of me just willy nilly writing an iterator :) Any reference
to `this.source` should be replaced with `this.getSource()`. In `next()`,
your workaround ends up calling `this.hasTop()` as the while loop
condition. It will always return false because two lines up we set
`top_key` to null. We need to make sure that the source iterator has a top,
because we want to read data from it. We'll have to change the loop
condition to `while(this.getSource().hasTop())`. On line 38 of your code
we'll need to call `this.getSource().next()` instead of `this.next()`.

The iterator interface is documented, but there hasn't been a definitive
go-to for making one. I've been drafting a blog post, but since it doesn't
exist yet, hopefully the following will suffice.

The lifetime of an iterator is (usually) as follows:

(1) A new instance is called via Class.newInstance (so a no-args
constructor is needed)
(2) Init is called. This allows users to configure the iterator, set its
source, and possible check the environment. We can also call `deepCopy` on
the source if we want to have multiple sources (we'd do this if we wanted
to do a merge read out of multiple column families within a row).
(3) seek() is called. This gets our readers to the correct positions in the
data that are within the scan range the user requested, as well as turning
column families on or off. The name should reminiscent of seeking to some
key on disk.
(4) hasTop() is called. If true, that means we have data, and the iterator
has a key/value pair that can be retrieved by calling getTopKey() and
getTopValue(). If fasle, we're done because there's no data to return.
(5) next() is called. This will attempt find a new top key and value. We go
back to (4) to see if next was successful in finding a new top key/value
and will repeat until the client is satisfied or hasTop() returns false.

You can kind of make a state machine out of those steps where we loop
between (4) and (5) until there's no data. There are more advanced
workflows where next() can be reading from multiple sources, as well as
seeking them to different positions in the tablet.


On Mon, Jul 14, 2014 at 4:51 PM, Michael Moss <michael.moss@gmail.com>
wrote:

> Thanks, William. I was just hitting you up for an example :)
>
> I adapted your pseudocode (http://pastebin.com/ufPJq0g3), but noticed
> that "this.source" in your example didn't have visibility. Did I worked
> around it correctly?
>
> When I add my iterator to my table and run scan from the shell, it returns
> nothing - what should I expect here? In general I've found the iterator
> interface pretty confusing and haven't spent the time wrapping my head
> around it yet. Any documentation or examples (beyond what I could find on
> the site or in the code) appreciated!
>
> *root@dev> table pojo*
> *root@dev pojo> listiter -scan -t pojo*
> *-*
> *-    Iterator counter, scan scope options:*
> *-        iteratorPriority = 10*
> *-        iteratorClassName = iterators.Counter*
> *-*
> *root@dev pojo> scan*
> *root@dev pojo>*
>
> Best,
>
> -Mike
>
>
>
>
> On Mon, Jul 14, 2014 at 4:07 PM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> For a bit of psuedocode, I'd probably make a class that did something
>> akin to: http://pastebin.com/pKqAeeCR
>>
>> I wrote that up real quick in a text editor-- it won't compile or
>> anything, but should point you in the right direction.
>>
>>
>> On Mon, Jul 14, 2014 at 3:44 PM, William Slacum <
>> wilhelm.von.cloud@accumulo.net> wrote:
>>
>>> Hi Mike!
>>>
>>> The Combiner interface is only for aggregating keys within a single row.
>>> You can probably get away with implementing your combining logic in a
>>> WrappingIterator that reads across all the rows in a given tablet.
>>>
>>> To do some combine/fold/reduce operation, Accumulo needs the input type
>>> to be the same as the output type. The combiner doesn't have a notion of a
>>> "present" type (as you'd see in something like Algebird's Groups), but you
>>> can use another iterator to perform your transformation.
>>>
>>> If you wanted to extract the "count" field from your Avro object, you
>>> could write a new Iterator that took your Avro object, extracted the
>>> desired field, and returned it as its top value. You can then set this
>>> iterator as the source of the aggregator, either programmatically or via by
>>> wrapping the source object passed to the aggregator in its
>>> SortedKeyValueIterator#init call.
>>>
>>> This is a bit inefficient as you'd have to serialize to a Value and then
>>> immediately deserialize it in the iterator above it. You could mitigate
>>> this by exposing a method that would get the extracted value before
>>> serializing it.
>>>
>>> This kind of counting also requires client side logic to do a final
>>> combine operation, since the aggregations from all the tservers are partial
>>> results.
>>>
>>> I believe that CountingIterator is not meant for user consumption, but I
>>> do not know if it's related to your issue in trying to use it from the
>>> shell. Iterators set through the shell, in previous versions of Accumulo,
>>> have a requirement to implement OptionDescriber. Many default iterators do
>>> not implement this, and thus can't set in the shell.
>>>
>>>
>>>
>>> On Mon, Jul 14, 2014 at 2:44 PM, Michael Moss <michael.moss@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I'm curious what the best practices are around persisting complex
>>>> types/data in Accumulo (and aggregating on fields within them).
>>>>
>>>> Let's say I have (row, column family, column qualifier, value):
>>>> "A" "foo" "" MyHugeAvroObject(count=2)
>>>> "A" "foo" "" MyHugeAvroObject(count=3)
>>>>
>>>> Let's say MyHugeAvroObject has a field "Integer count" with the values
>>>> above.
>>>>
>>>> What is the best way to aggregate on row, column family, column
>>>> qualifier by count? In my above example:
>>>> "A" "foo" "" 5
>>>>
>>>> The TypedValueCombiner.typedReduce method can deserialize any "V", in
>>>> my case MyHugeAvroObject, but it needs to return a value of type "V". What
>>>> are the best practices for deeply nested/complex objects? It's not always
>>>> straightforward to map a complex Avro type into Row -> Column Family ->
>>>> Column Qualifier.
>>>>
>>>> Rather than using a TypedCombiner, I looked into using an Aggregator
>>>> (which appears deprecated as of 1.4), which appears to let me return
>>>> arbitrary values, but despite running setiter, my aggregator doesn't seem
>>>> to do anything.
>>>>
>>>> I also tried looking at implementing a WrappingIterator, which also
>>>> appears to allow me to return arbitary values (such as Accumulo's
>>>> CountingIterator), but I get cryptic errors when trying to setiter, I'm on
>>>> Accumulo 1.6:
>>>>
>>>> root@dev kyt> setiter -t kyt -scan -p 10 -n countingIter -class
>>>> org.apache.accumulo.core.iterators.system.CountingIterator
>>>> 2014-07-14 11:12:55,623 [shell.Shell] ERROR:
>>>> java.lang.IllegalArgumentException:
>>>> org.apache.accumulo.core.iterators.system.CountingIterator
>>>>
>>>> This is odd because other included implementations of WrappingIterator
>>>> seem to work (perhaps the implementation of CountingIterator is dated):
>>>> root@dev kyt> setiter -t kyt -scan -p 10 -n deletingIterator -class
>>>> org.apache.accumulo.core.iterators.system.DeletingIterator
>>>> The iterator class does not implement OptionDescriber. Consider this
>>>> for better iterator configuration using this setiter command.
>>>> Name for iterator (enter to skip):
>>>>
>>>> All in all, how can I aggregate simple values, like counters from rows
>>>> with complex Avro objects as Values without having to add aggregations
>>>> fields to these Value objects?
>>>>
>>>> Thanks!
>>>>
>>>> -Mike
>>>>
>>>
>>>
>>
>

Mime
View raw message