accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sukant Hajra" <qn2b6c2...@snkmail.com>
Subject Re: strategies beyond intersecting iterators?
Date Mon, 02 Jul 2012 04:43:26 GMT
Excerpts from William Slacum's message of Sun Jul 01 23:23:50 -0500 2012:
>
> To give a bit of a visualization, you can think of this structure in tree
> form:
> 
> 
>               Intersection
>          /           |            \
>    "Sukant"   "Bill"        "Josh"
>       [0,          [1,            [3,
>        3,           2,             4,
>        6]           3]             5]
> 
> 
> On our first pass, the IntersectingIterator will note that its children
> point to the document IDs 0, 1 and 3. Since each list of doc IDs is sorted,
> we can deduce that the earliest doc ID that could be a potential match is
> 3. So, it will seek the term sources for "Sukant" and "Bill" to at least
> the key {row: "partitionN", colf: <term>, colq: "3"}. On the next pass,
> we'll note that each term source is pointing to doc ID 3. This means we've
> found an intersection, so the top level IntersectingIterator will return
> docID 3.
> 
> When the session requests the next matching docID, the iterator will
> advance each iterator by calling next(). The IntersectingIterator now sees
> its children are all positioned at docIDs [6, null, 5] (the `null` value
> arises because the "Bill" term source doesn't have a key beyond {row:
> "partitionN", colf: "Bill", colq: "3"}. This state means that the
> intersection is done, because one of the term sources has exhausted its
> possible values, so there's no doc ID that will occur in all three lists.

Hi Bill,

Thanks for the great illustration above of the algorithm.  I get the feeling
that this algorithm is pretty central to the implementation, and unlikely to
change.   I don't feel guilty relying on it for ordering.

Does that assumption sound dangerous?

Assuming it's not.  My next question comes from reading more closely the
documentation for IndexedDocIterator [1], which indicates that we're not
indexing raw doc IDs, but at tuple of (docType, docId).

Is this correct?

The documentation is a little ambiguous of what the docType is for, but I'm
wondering if I can use it to assert an ordering for the results of an
intersecting iterator.  For instance, we may use something timestamp-y for the
docType if that's the case.

I'll have prototypes done early next week to confirm my suspicions above, but
in the meantime, it's great to get expert feedback.

Thanks all,
Sukant

[1] http://accumulo.apache.org/1.4/apidocs/

Mime
View raw message