accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <scubafu...@gmail.com>
Subject Re: Iterators adding data: IteratorEnvironment.registerSideChannel?
Date Mon, 16 Feb 2015 16:34:55 GMT
Dylan,

If I recall correctly (which I give about 30% odds), the original purpose
of the side channel was to split up things like delete tombstone entries
from "regular" entries so that other iterators sitting on top of a
bifurcating iterator wouldn't have to handle the special tombstone
preservation logic. This worked in theory, but it never really caught on.
I'm not sure any operational code is calling the registerSideChannel method
right now, so you're sort of in pioneering territory. That said, this looks
like it should work as you described it.

Can you describe why you want to use a side channel instead of implementing
the merge in your own iterator (e.g. subclassing MultiIterator and
overriding the init method)? This has implications on composibility with
other iterators, since downstream iterators would not see anything sent to
the side channel but they would see things merged and returned by a
MultiIterator.

Adam
 On Feb 16, 2015 3:18 AM, "Dylan Hutchison" <dhutchis@stevens.edu> wrote:

> If you can do a merge sort insertion, then you can guarantee order and
>> it's fine.
>>
> Yep, I guarantee the iterator we add as a side channel will emit tuples in
> sorted order.
>
> On a suggestion from David Medinets, I modified my testing code to use a
> MiniAccumuloCluster set to 2 tablet servers.  I then set a table split on
> "row3" before launching the compaction.  The result looks good.  Here is
> output from a run on a local Accumulo instance.  Note that we write more
> values than we read.
>
> 2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3<
> (USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] -->
> hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp
>  [name:InjectIterator, priority:15,
> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
> 2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3
> (USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf]
> --> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp
>  [name:InjectIterator, priority:15,
> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
> 2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3 2
> read | 4 written* |    111 entries/sec |  0.018 secs
> 2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3< 1
> read | 4 written* |     43 entries/sec |  0.023 secs
>
>
> In addition, output from the DebugIterator looks as expected.  There is a
> re-seek after reading the first tablet to the key after the last entry
> returned in the first tablet.
>
> DEBUG:
> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63,
> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e)
> DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false)
>
> ... <snipped logs>
>
> DEBUG:
> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59,
> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f)
> DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807
> false,+inf), [], false)
>
>
> It seems the side channel strategy will hold up.  We have opened a new
> world of Accumulo-foo.  Of course, the real test is a multi-node instance
> with more than 10 entries of data.
>
> Regards, Dylan
>
>
> On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <awells@clearedgeit.com>
> wrote:
>
>> The main issue with adding data in an iterator is order. If you have can
>> do a merge sort insertion, then you can guarantee order and  its fine. But
>> if you are inserting base on input you cannot guarantee order, and it can
>> only be on scan iterator.
>>  On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <dhutchis@stevens.edu> wrote:
>>
>>> Hello all,
>>>
>>> I've been toying with the registerSideChannel(iter)
>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)>
method
>>> on the IteratorEnvironment passed to iterators through the init() method.
>>> From what I can tell, the method allows you to add another iterator as a
>>> top level source, to be merged in along with other usual top-level sources
>>> such as the in-memory cache and RFiles.
>>>
>>> Are there any downsides to using registerSideChannel( ) to "add new
>>> data" to an iterator chain?  It looks like this is fairly stable, so long
>>> as the iterator we add as a side channel implements seek() properly so as
>>> to only return entries whose rows are within a tablet.  I imagine it works
>>> like so:
>>>
>>> Suppose we set a custom iterator InjectIterator that registers a side
>>> channel inside init() at priority 5 as a one-time major compaction
>>> iterator.  InjectIterator forwards other operations to its parent, as in
>>> WrappingIterator
>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>.
>>> We start the compaction:
>>>
>>> Tablet 1 (a,g]
>>>
>>>    1. init() called on InjectIterator.  Creates the side channel
>>>    iterator, calls init() on it, and registers it.
>>>    2. init() called on VersioningIterator.
>>>    3. init() called on top level iterators, including Rfiles, in-memory
>>>    cache and the new side channel.
>>>    4. seek( (a,g] ) called on InjectIterator.
>>>    5. seek( (a,g] ) called on VersioningIterator.
>>>    6. seek( (a,g] ) called on top level iterators
>>>    7. next() called on InjectIterator. Forwards to parent.
>>>    8. next() called on VersioningIterator. Forwards to parent.
>>>    9. next() called on top level iterator (a MultiIterator
>>>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>).
>>>    The next value is read from all the top-level iterator sources and the one
>>>    with the least key is cached ready to go.
>>>    10. ...
>>>
>>> Tablet 2 (g,p)  --- same as tablet 1 except steps 4-6 call seek( (g,p)
>>> ).  Done in parallel with tablet 1 if on a different tablet server.
>>>
>>> Is this an accurate depiction?  Anything I should treat with caution?
>>> It seems to work on my single-node instance, so tips about difficulties
>>> going to multi-node are good.
>>>
>>> Code available here.
>>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166>
>>>
>>> Regards,
>>> Dylan Hutchison
>>>
>>> --
>>> www.cs.stevens.edu/~dhutchis
>>>
>>
>
>
> --
> www.cs.stevens.edu/~dhutchis
>

Mime
View raw message