accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: EXTERNAL: Re: Custom Iterators
Date Wed, 22 Aug 2012 23:44:17 GMT
You could compare clone()'ing multiple sources inside of an iterator to 
maintaining multiple pointers at different offsets to a file on disk. 
The clone()'ed iterators are all operating over the same row; however, 
they are all pointing at different offsets (keys).

Concretely, the OrIterator is sent a list of terms to union, and 
clone()'s the source it was given for each term (note the addTerm() 
method on the class). The OrIterator attempts to find the index entries 
for each term, and return the minimum docid to satisfy the 
SortedKeyValueIterator contract.

Given your comment on the TermSource.compareTo() method's comment 
(....), yes, it does appear that you have found a bug. That comment 
about "multiple rows in a tablet" should really be removed, IMO. It's 
rather confusing, and shouldn't matter when you're writing an iterator. 
In other words, you, as a developer, don't need to know what rows are 
contained in a tablet. The only issue you need to worry about is if 
you're trying to do some operation *across* rows. Given that all of the 
index entries for a single document are contained in one row (which 
happens to just be a bucket in the Wiki application), this point is 

You might also note that the next() method on the OrIterator doesn't 
check if the new topKey for the term it just advanced is contained in 
the current Range before adding it back to the PriorityQueue. This could 
cause a term who has passed outside of the initial Range provided to 
seek() to be added unnecessarily to said PriorityQueue.

+2 bugs

On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
> William,
> Thanks for the quick response. Let me start by stating what I 
> understand about Iterators (to be sure I’m not completely off my rocker).
> 1. An iterator receives, as its source, another iterator (by way of 
> the init method), which becomes it’s source of data.
> 2. When seek is called on an iterator, the iterator should respond by 
> moving the pointer to the first key/value that applied to that 
> iterator and is within the range
> a. Depending on the iterator, that may not be the first key in the range
> b. Only keys (and their corresponding values) which include one of the 
> column families listed in the family list should be available as 
> topKey and topValue. (this restriction should continue until seek is 
> called again, meaning that subsequent calls to next will only proceed 
> to key/values that also match the list provided.
> c. Generally speaking, a seek will result in the iterator calling seek 
> on its source iterator (although the parameters passed in may be 
> different)
> 3. If an iterator needs configuration beyond just the source obtained 
> in the init call, it can get that through the options and/or env.
> 4. Iterators do not necessarily return the same types of key/values as 
> they consume. ie, a Combiner may call next() and getTopValue multiple 
> times each time those methods are called on it. And the value it 
> returns as topKey may be a key that doesn’t actually exist in the 
> datastore itself.
> So my questions:
> Is it correct that once seek is called, only topKeys that conform to 
> the columnFamilies collection should be returned. And that this 
> behavior persists until seek is called again, even when next has been 
> called?
> How do iterators like the OrIterator obtain multiple sources? (I 
> assume you were trying to address that with #3 in your response, but I 
> don’t understand what you mean by clone()ing the source. That would 
> give me copies of the one source, but not multiple sources)
> Why do some iterators have so many constructors if the system will 
> simply construct them from the default constructor?
> Some iterators (such as OrIterator) throw an exception if init is 
> called. How do these iterators get constructed and initialized?
> If OrIterator can do what I’m asking for, how do I get it the “terms” 
> and what format do they come in? You mentioned JEXL expressions, but I 
> haven’t seen anything about them in the documentation.
> As for my statement about the OrIterator and multiple rows, the 
> comments on the compareTo for OrIterator.TermSource state “If your 
> implementation can have more than one row in a tablet, you must 
> compare row key here first, then column qualifier.” But the code does 
> not do so. It may be that I’m just not fully understanding the code, 
> however.
> Finally, I’m actually trying to do something a little more complex 
> than just what I described below. This reply is already too long and 
> had too many questions in it, but I’ll get more detail out after I 
> have a better handle on how the iterator framework works.
> Thanks,
> Tejay
> *From:*William Slacum []
> *Sent:* Wednesday, August 22, 2012 3:00 PM
> *To:*
> *Subject:* EXTERNAL: Re: Custom Iterators
> An or clause should be able to handle an enumeration of values, as 
> that's supported in a JEXL expression. It would not, however, surprise 
> me if those iterators could not handle multiple rows in a tablet. If 
> you can reproduce that, please file a ticket. There will be a large 
> update occurring to the Wiki example in the near future.
> Do you have any specific questions about how you should structure your 
> iterator or the contract? Making a tutorial has been on my to do list, 
> but we all know how to do lists end up...
> The big things to remember are:
> 1) The call order: Your iterator will be created via the default 
> constructor, init() will be called, then seek(). After seek() is 
> called, your iterator should have a top if there is data available. A 
> client then can call hasTop(), getTopKey() and getTopValue() to check 
> and retrieve data (similar to hasNext() and next()) and then next to 
> advance the pointer.
> 2) Your iterator can be destroyed during a scan and then 
> reconstructed, being passed in the last key returned to the client as 
> the start of the range.
> 3) You can have multiple sources feed into a single iterator in a tree 
> like fashion by clone()'ing the source passed in to init.
> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
> < <>> wrote:
> All,
> I’m interested in writing a custom iterator, and I’ve been looking for 
> documentation on how to do so. Thus far, I’ve not been able to find 
> anything beyond the java docs in SortedKeyValueIterator and a few 
> other sub-classes. A few of the examples use Iterators, but provide no 
> real info on how to properly implement one. Is there anywhere to find 
> general guidance on the iterator stack?
> (If you’re interested)
> Specifically, for those that are curious, I’m trying to implement 
> something similar to the wikisearch example, but with some key 
> differences. In my case, I’ve got a file with various attributes that 
> being indexed. So for each file there are 5 attributes, and each 
> attribute has a fixed number of possible values. For example (totally 
> made up):
> personID, gender, hair color, country, race, personRecord
> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
> Row:binID; ColFam:”D”; ColQ:personID; value:personRecord
> A typical query would be:
> Give me the personRecord for all people with:
> Gender: male &
> Hair color: blond or brown &
> Country: USA or England or china or korea &
> Race: white or oriental
> The existing Iterators used in the wikisearch example are unable to 
> handle the “or” clauses in each attribute.
> The OrIterator doesn’t appear to handle the possibility more than one 
> row per tablet
> Thanks,
> Tejay Cardon

View raw message