accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@stevens.edu>
Subject Re: Iterators adding data: IteratorEnvironment.registerSideChannel?
Date Mon, 16 Feb 2015 18:23:06 GMT
> why you want to use a side channel instead of implementing the merge in
> your own iterator
>
Here is a picture showing the difference--

Fig. A: Using a side channel to add a top-level iterator.

RfileIter1  RfileIter2  InjectIterator ...
|          /           /
|_________/           /
o__*(3-way merge)*_____/

|

VersioningIterator
|

OtherIterators
|
v
...


Fig. B: Merging in the data at a later stage

RfileIter1  RfileIter2  ...
|          /
o_________/

|

VersioningIterator
|
|         InjectIterator

o________/

|

OtherIterators
|
v
...

(note: we're free to add iterators before the VersioningIterator too)

Unless the order of iterators matters (e.g., the VersioningIterator
position matters if InjectIterator generates an entry with the same row,
colFamily and colQualifier as an entry in the table), the two styles will
give the same results.

This has implications on composibility with other iterators, since
> downstream iterators would not see anything sent to the side channel but
> they would see things merged and returned by a MultiIterator.
>
If the iterator is at the top level, then every iterator below it will see
output from the top level iterator.  Did you mean composibility with other
iterators added at the top level?  If hypothetical iterator
"InjectIterator2" needs to see the results of "InjectIterator", then we
need to place InjectIterator2 below InjectIterator on the hierarchy,
whether in Fig. A or Fig. B.

For my particular situation, reading from another Accumulo table inside an
iterator, I'm not sure which is better.  I like the idea of adding another
data stream as a top-level source, but Fig. B is possible too.

Regards,
Dylan Hutchison


On Mon, Feb 16, 2015 at 11:34 AM, Adam Fuchs <scubafuchs@gmail.com> wrote:

> Dylan,
>
> If I recall correctly (which I give about 30% odds), the original purpose
> of the side channel was to split up things like delete tombstone entries
> from "regular" entries so that other iterators sitting on top of a
> bifurcating iterator wouldn't have to handle the special tombstone
> preservation logic. This worked in theory, but it never really caught on.
> I'm not sure any operational code is calling the registerSideChannel method
> right now, so you're sort of in pioneering territory. That said, this looks
> like it should work as you described it.
>
> Can you describe why you want to use a side channel instead of
> implementing the merge in your own iterator (e.g. subclassing MultiIterator
> and overriding the init method)? This has implications on composibility
> with other iterators, since downstream iterators would not see anything
> sent to the side channel but they would see things merged and returned by a
> MultiIterator.
>
> Adam
>  On Feb 16, 2015 3:18 AM, "Dylan Hutchison" <dhutchis@stevens.edu> wrote:
>
>> If you can do a merge sort insertion, then you can guarantee order and
>>> it's fine.
>>>
>> Yep, I guarantee the iterator we add as a side channel will emit tuples
>> in sorted order.
>>
>> On a suggestion from David Medinets, I modified my testing code to use a
>> MiniAccumuloCluster set to 2 tablet servers.  I then set a table split on
>> "row3" before launching the compaction.  The result looks good.  Here is
>> output from a run on a local Accumulo instance.  Note that we write more
>> values than we read.
>>
>> 2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3<
>> (USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] -->
>> hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp
>>  [name:InjectIterator, priority:15,
>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
>> 2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3
>> (USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf]
>> --> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp
>>  [name:InjectIterator, priority:15,
>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
>> 2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3 2
>> read | 4 written* |    111 entries/sec |  0.018 secs
>> 2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3< 1
>> read | 4 written* |     43 entries/sec |  0.023 secs
>>
>>
>> In addition, output from the DebugIterator looks as expected.  There is a
>> re-seek after reading the first tablet to the key after the last entry
>> returned in the first tablet.
>>
>> DEBUG:
>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63,
>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e)
>> DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false)
>>
>> ... <snipped logs>
>>
>> DEBUG:
>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59,
>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f)
>> DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807
>> false,+inf), [], false)
>>
>>
>> It seems the side channel strategy will hold up.  We have opened a new
>> world of Accumulo-foo.  Of course, the real test is a multi-node instance
>> with more than 10 entries of data.
>>
>> Regards, Dylan
>>
>>
>> On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <awells@clearedgeit.com>
>> wrote:
>>
>>> The main issue with adding data in an iterator is order. If you have can
>>> do a merge sort insertion, then you can guarantee order and  its fine. But
>>> if you are inserting base on input you cannot guarantee order, and it can
>>> only be on scan iterator.
>>>  On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <dhutchis@stevens.edu>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I've been toying with the registerSideChannel(iter)
>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)>
method
>>>> on the IteratorEnvironment passed to iterators through the init() method.
>>>> From what I can tell, the method allows you to add another iterator as a
>>>> top level source, to be merged in along with other usual top-level sources
>>>> such as the in-memory cache and RFiles.
>>>>
>>>> Are there any downsides to using registerSideChannel( ) to "add new
>>>> data" to an iterator chain?  It looks like this is fairly stable, so long
>>>> as the iterator we add as a side channel implements seek() properly so as
>>>> to only return entries whose rows are within a tablet.  I imagine it works
>>>> like so:
>>>>
>>>> Suppose we set a custom iterator InjectIterator that registers a side
>>>> channel inside init() at priority 5 as a one-time major compaction
>>>> iterator.  InjectIterator forwards other operations to its parent, as in
>>>> WrappingIterator
>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>.
>>>> We start the compaction:
>>>>
>>>> Tablet 1 (a,g]
>>>>
>>>>    1. init() called on InjectIterator.  Creates the side channel
>>>>    iterator, calls init() on it, and registers it.
>>>>    2. init() called on VersioningIterator.
>>>>    3. init() called on top level iterators, including Rfiles,
>>>>    in-memory cache and the new side channel.
>>>>    4. seek( (a,g] ) called on InjectIterator.
>>>>    5. seek( (a,g] ) called on VersioningIterator.
>>>>    6. seek( (a,g] ) called on top level iterators
>>>>    7. next() called on InjectIterator. Forwards to parent.
>>>>    8. next() called on VersioningIterator. Forwards to parent.
>>>>    9. next() called on top level iterator (a MultiIterator
>>>>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>).
>>>>    The next value is read from all the top-level iterator sources and the
one
>>>>    with the least key is cached ready to go.
>>>>    10. ...
>>>>
>>>> Tablet 2 (g,p)  --- same as tablet 1 except steps 4-6 call seek( (g,p)
>>>> ).  Done in parallel with tablet 1 if on a different tablet server.
>>>>
>>>> Is this an accurate depiction?  Anything I should treat with caution?
>>>> It seems to work on my single-node instance, so tips about difficulties
>>>> going to multi-node are good.
>>>>
>>>> Code available here.
>>>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166>
>>>>
>>>> Regards,
>>>> Dylan Hutchison
>>>>
>>>> --
>>>> www.cs.stevens.edu/~dhutchis
>>>>
>>>
>>
>>
>> --
>> www.cs.stevens.edu/~dhutchis
>>
>


-- 
www.cs.stevens.edu/~dhutchis

Mime
View raw message