accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@stevens.edu>
Subject Re: Iterators adding data: IteratorEnvironment.registerSideChannel?
Date Mon, 16 Feb 2015 08:14:28 GMT
>
> If you can do a merge sort insertion, then you can guarantee order and
> it's fine.
>
Yep, I guarantee the iterator we add as a side channel will emit tuples in
sorted order.

On a suggestion from David Medinets, I modified my testing code to use a
MiniAccumuloCluster set to 2 tablet servers.  I then set a table split on
"row3" before launching the compaction.  The result looks good.  Here is
output from a run on a local Accumulo instance.  Note that we write more
values than we read.

2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3<
(USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] -->
hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp
 [name:InjectIterator, priority:15,
class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3
(USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf]
--> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp
 [name:InjectIterator, priority:15,
class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3 2
read | 4 written* |    111 entries/sec |  0.018 secs
2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3< 1
read | 4 written* |     43 entries/sec |  0.023 secs


In addition, output from the DebugIterator looks as expected.  There is a
re-seek after reading the first tablet to the key after the last entry
returned in the first tablet.

DEBUG:
init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63,
{}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e)
DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false)

... <snipped logs>

DEBUG:
init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59,
{}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f)
DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807
false,+inf), [], false)


It seems the side channel strategy will hold up.  We have opened a new
world of Accumulo-foo.  Of course, the real test is a multi-node instance
with more than 10 entries of data.

Regards, Dylan


On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <awells@clearedgeit.com>
wrote:

> The main issue with adding data in an iterator is order. If you have can
> do a merge sort insertion, then you can guarantee order and  its fine. But
> if you are inserting base on input you cannot guarantee order, and it can
> only be on scan iterator.
>  On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <dhutchis@stevens.edu> wrote:
>
>> Hello all,
>>
>> I've been toying with the registerSideChannel(iter)
>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)>
method
>> on the IteratorEnvironment passed to iterators through the init() method.
>> From what I can tell, the method allows you to add another iterator as a
>> top level source, to be merged in along with other usual top-level sources
>> such as the in-memory cache and RFiles.
>>
>> Are there any downsides to using registerSideChannel( ) to "add new data"
>> to an iterator chain?  It looks like this is fairly stable, so long as the
>> iterator we add as a side channel implements seek() properly so as to only
>> return entries whose rows are within a tablet.  I imagine it works like so:
>>
>> Suppose we set a custom iterator InjectIterator that registers a side
>> channel inside init() at priority 5 as a one-time major compaction
>> iterator.  InjectIterator forwards other operations to its parent, as in
>> WrappingIterator
>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>.
>> We start the compaction:
>>
>> Tablet 1 (a,g]
>>
>>    1. init() called on InjectIterator.  Creates the side channel
>>    iterator, calls init() on it, and registers it.
>>    2. init() called on VersioningIterator.
>>    3. init() called on top level iterators, including Rfiles, in-memory
>>    cache and the new side channel.
>>    4. seek( (a,g] ) called on InjectIterator.
>>    5. seek( (a,g] ) called on VersioningIterator.
>>    6. seek( (a,g] ) called on top level iterators
>>    7. next() called on InjectIterator. Forwards to parent.
>>    8. next() called on VersioningIterator. Forwards to parent.
>>    9. next() called on top level iterator (a MultiIterator
>>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>).
>>    The next value is read from all the top-level iterator sources and the one
>>    with the least key is cached ready to go.
>>    10. ...
>>
>> Tablet 2 (g,p)  --- same as tablet 1 except steps 4-6 call seek( (g,p)
>> ).  Done in parallel with tablet 1 if on a different tablet server.
>>
>> Is this an accurate depiction?  Anything I should treat with caution?  It
>> seems to work on my single-node instance, so tips about difficulties going
>> to multi-node are good.
>>
>> Code available here.
>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166>
>>
>> Regards,
>> Dylan Hutchison
>>
>> --
>> www.cs.stevens.edu/~dhutchis
>>
>


-- 
www.cs.stevens.edu/~dhutchis

Mime
View raw message