accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <scubafu...@gmail.com>
Subject Re: Iterators adding data: IteratorEnvironment.registerSideChannel?
Date Mon, 16 Feb 2015 18:56:27 GMT
"top-level" with respect to the side channel description is inverted with
respect to your diagram. Fig. A should be more like this:

RfileIter1  RfileIter2
|          /
|_________/
Merge
|
VersioningIterator
|
OtherIterators   InjectIterator
|               /
|______________/
Merge
|
v

Thus, VersioningIterator and OtherIterators don't see any of the entries
coming from InjectIterator.

Adam


On Mon, Feb 16, 2015 at 1:23 PM, Dylan Hutchison <dhutchis@stevens.edu>
wrote:

>
> why you want to use a side channel instead of implementing the merge in
>> your own iterator
>>
> Here is a picture showing the difference--
>
> Fig. A: Using a side channel to add a top-level iterator.
>
> RfileIter1  RfileIter2  InjectIterator ...
> |          /           /
> |_________/           /
> o__*(3-way merge)*_____/
>
> |
>
> VersioningIterator
> |
>
> OtherIterators
> |
> v
> ...
>
>
> Fig. B: Merging in the data at a later stage
>
> RfileIter1  RfileIter2  ...
> |          /
> o_________/
>
> |
>
> VersioningIterator
> |
> |         InjectIterator
>
> o________/
>
> |
>
> OtherIterators
> |
> v
> ...
>
> (note: we're free to add iterators before the VersioningIterator too)
>
> Unless the order of iterators matters (e.g., the VersioningIterator
> position matters if InjectIterator generates an entry with the same row,
> colFamily and colQualifier as an entry in the table), the two styles will
> give the same results.
>
> This has implications on composibility with other iterators, since
>> downstream iterators would not see anything sent to the side channel but
>> they would see things merged and returned by a MultiIterator.
>>
> If the iterator is at the top level, then every iterator below it will see
> output from the top level iterator.  Did you mean composibility with other
> iterators added at the top level?  If hypothetical iterator
> "InjectIterator2" needs to see the results of "InjectIterator", then we
> need to place InjectIterator2 below InjectIterator on the hierarchy,
> whether in Fig. A or Fig. B.
>
> For my particular situation, reading from another Accumulo table inside an
> iterator, I'm not sure which is better.  I like the idea of adding another
> data stream as a top-level source, but Fig. B is possible too.
>
> Regards,
> Dylan Hutchison
>
>
> On Mon, Feb 16, 2015 at 11:34 AM, Adam Fuchs <scubafuchs@gmail.com> wrote:
>
>> Dylan,
>>
>> If I recall correctly (which I give about 30% odds), the original purpose
>> of the side channel was to split up things like delete tombstone entries
>> from "regular" entries so that other iterators sitting on top of a
>> bifurcating iterator wouldn't have to handle the special tombstone
>> preservation logic. This worked in theory, but it never really caught on.
>> I'm not sure any operational code is calling the registerSideChannel method
>> right now, so you're sort of in pioneering territory. That said, this looks
>> like it should work as you described it.
>>
>> Can you describe why you want to use a side channel instead of
>> implementing the merge in your own iterator (e.g. subclassing MultiIterator
>> and overriding the init method)? This has implications on composibility
>> with other iterators, since downstream iterators would not see anything
>> sent to the side channel but they would see things merged and returned by a
>> MultiIterator.
>>
>> Adam
>>  On Feb 16, 2015 3:18 AM, "Dylan Hutchison" <dhutchis@stevens.edu> wrote:
>>
>>> If you can do a merge sort insertion, then you can guarantee order and
>>>> it's fine.
>>>>
>>> Yep, I guarantee the iterator we add as a side channel will emit tuples
>>> in sorted order.
>>>
>>> On a suggestion from David Medinets, I modified my testing code to use a
>>> MiniAccumuloCluster set to 2 tablet servers.  I then set a table split on
>>> "row3" before launching the compaction.  The result looks good.  Here is
>>> output from a run on a local Accumulo instance.  Note that we write more
>>> values than we read.
>>>
>>> 2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3<
>>> (USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] -->
>>> hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp
>>>  [name:InjectIterator, priority:15,
>>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
>>> 2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3
>>> (USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf]
>>> --> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp
>>>  [name:InjectIterator, priority:15,
>>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
>>> 2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3
>>> 2 read | 4 written* |    111 entries/sec |  0.018 secs
>>> 2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3<
>>> 1 read | 4 written* |     43 entries/sec |  0.023 secs
>>>
>>>
>>> In addition, output from the DebugIterator looks as expected.  There is
>>> a re-seek after reading the first tablet to the key after the last entry
>>> returned in the first tablet.
>>>
>>> DEBUG:
>>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63,
>>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e)
>>> DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false)
>>>
>>> ... <snipped logs>
>>>
>>> DEBUG:
>>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59,
>>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f)
>>> DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807
>>> false,+inf), [], false)
>>>
>>>
>>> It seems the side channel strategy will hold up.  We have opened a new
>>> world of Accumulo-foo.  Of course, the real test is a multi-node instance
>>> with more than 10 entries of data.
>>>
>>> Regards, Dylan
>>>
>>>
>>> On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <awells@clearedgeit.com>
>>> wrote:
>>>
>>>> The main issue with adding data in an iterator is order. If you have
>>>> can do a merge sort insertion, then you can guarantee order and  its fine.
>>>> But if you are inserting base on input you cannot guarantee order, and it
>>>> can only be on scan iterator.
>>>>  On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <dhutchis@stevens.edu>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I've been toying with the registerSideChannel(iter)
>>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)>
method
>>>>> on the IteratorEnvironment passed to iterators through the init() method.
>>>>> From what I can tell, the method allows you to add another iterator as
a
>>>>> top level source, to be merged in along with other usual top-level sources
>>>>> such as the in-memory cache and RFiles.
>>>>>
>>>>> Are there any downsides to using registerSideChannel( ) to "add new
>>>>> data" to an iterator chain?  It looks like this is fairly stable, so
long
>>>>> as the iterator we add as a side channel implements seek() properly so
as
>>>>> to only return entries whose rows are within a tablet.  I imagine it
works
>>>>> like so:
>>>>>
>>>>> Suppose we set a custom iterator InjectIterator that registers a side
>>>>> channel inside init() at priority 5 as a one-time major compaction
>>>>> iterator.  InjectIterator forwards other operations to its parent, as
in
>>>>> WrappingIterator
>>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>.
>>>>> We start the compaction:
>>>>>
>>>>> Tablet 1 (a,g]
>>>>>
>>>>>    1. init() called on InjectIterator.  Creates the side channel
>>>>>    iterator, calls init() on it, and registers it.
>>>>>    2. init() called on VersioningIterator.
>>>>>    3. init() called on top level iterators, including Rfiles,
>>>>>    in-memory cache and the new side channel.
>>>>>    4. seek( (a,g] ) called on InjectIterator.
>>>>>    5. seek( (a,g] ) called on VersioningIterator.
>>>>>    6. seek( (a,g] ) called on top level iterators
>>>>>    7. next() called on InjectIterator. Forwards to parent.
>>>>>    8. next() called on VersioningIterator. Forwards to parent.
>>>>>    9. next() called on top level iterator (a MultiIterator
>>>>>    <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>).
>>>>>    The next value is read from all the top-level iterator sources and
the one
>>>>>    with the least key is cached ready to go.
>>>>>    10. ...
>>>>>
>>>>> Tablet 2 (g,p)  --- same as tablet 1 except steps 4-6 call seek( (g,p)
>>>>> ).  Done in parallel with tablet 1 if on a different tablet server.
>>>>>
>>>>> Is this an accurate depiction?  Anything I should treat with caution?
>>>>> It seems to work on my single-node instance, so tips about difficulties
>>>>> going to multi-node are good.
>>>>>
>>>>> Code available here.
>>>>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166>
>>>>>
>>>>> Regards,
>>>>> Dylan Hutchison
>>>>>
>>>>> --
>>>>> www.cs.stevens.edu/~dhutchis
>>>>>
>>>>
>>>
>>>
>>> --
>>> www.cs.stevens.edu/~dhutchis
>>>
>>
>
>
> --
> www.cs.stevens.edu/~dhutchis
>

Mime
View raw message