accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <tejay.e.car...@lmco.com>
Subject RE: EXTERNAL: Re: Custom Iterators
Date Thu, 23 Aug 2012 14:17:05 GMT
And I'm actually looking at the OrIterator in 1.4.1.  I really need to pull trunk just for
the additional insights it may give me, but ultimately I'll be running on the 1.4.1 release.

Tejay

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Wednesday, August 22, 2012 5:55 PM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Custom Iterators

... and I just realized I was looking at the OrIterator in trunk, not contrib/wikisearch x.x

Still, I think most of my comments still apply. Should verify with test cases...

On 08/22/2012 06:44 PM, Josh Elser wrote:
> You could compare clone()'ing multiple sources inside of an iterator 
> to maintaining multiple pointers at different offsets to a file on 
> disk. The clone()'ed iterators are all operating over the same row; 
> however, they are all pointing at different offsets (keys).
>
> Concretely, the OrIterator is sent a list of terms to union, and 
> clone()'s the source it was given for each term (note the addTerm() 
> method on the class). The OrIterator attempts to find the index 
> entries for each term, and return the minimum docid to satisfy the 
> SortedKeyValueIterator contract.
>
> Given your comment on the TermSource.compareTo() method's comment 
> (....), yes, it does appear that you have found a bug. That comment 
> about "multiple rows in a tablet" should really be removed, IMO. It's 
> rather confusing, and shouldn't matter when you're writing an 
> iterator. In other words, you, as a developer, don't need to know what 
> rows are contained in a tablet. The only issue you need to worry about 
> is if you're trying to do some operation *across* rows. Given that all 
> of the index entries for a single document are contained in one row 
> (which happens to just be a bucket in the Wiki application), this 
> point is meaningless.
>
> You might also note that the next() method on the OrIterator doesn't 
> check if the new topKey for the term it just advanced is contained in 
> the current Range before adding it back to the PriorityQueue. This 
> could cause a term who has passed outside of the initial Range 
> provided to seek() to be added unnecessarily to said PriorityQueue.
>
> +2 bugs
>
> On 08/22/2012 05:22 PM, Cardon, Tejay E wrote:
>>
>> William,
>>
>> Thanks for the quick response. Let me start by stating what I 
>> understand about Iterators (to be sure I'm not completely off my 
>> rocker).
>>
>> 1. An iterator receives, as its source, another iterator (by way of 
>> the init method), which becomes it's source of data.
>>
>> 2. When seek is called on an iterator, the iterator should respond by 
>> moving the pointer to the first key/value that applied to that 
>> iterator and is within the range
>>
>> a. Depending on the iterator, that may not be the first key in the 
>> range
>>
>> b. Only keys (and their corresponding values) which include one of 
>> the column families listed in the family list should be available as 
>> topKey and topValue. (this restriction should continue until seek is 
>> called again, meaning that subsequent calls to next will only proceed 
>> to key/values that also match the list provided.
>>
>> c. Generally speaking, a seek will result in the iterator calling 
>> seek on its source iterator (although the parameters passed in may be
>> different)
>>
>> 3. If an iterator needs configuration beyond just the source obtained 
>> in the init call, it can get that through the options and/or env.
>>
>> 4. Iterators do not necessarily return the same types of key/values 
>> as they consume. ie, a Combiner may call next() and getTopValue 
>> multiple times each time those methods are called on it. And the 
>> value it returns as topKey may be a key that doesn't actually exist 
>> in the datastore itself.
>>
>> So my questions:
>>
>> Is it correct that once seek is called, only topKeys that conform to 
>> the columnFamilies collection should be returned. And that this 
>> behavior persists until seek is called again, even when next has been 
>> called?
>>
>> How do iterators like the OrIterator obtain multiple sources? (I 
>> assume you were trying to address that with #3 in your response, but 
>> I don't understand what you mean by clone()ing the source. That would 
>> give me copies of the one source, but not multiple sources)
>>
>> Why do some iterators have so many constructors if the system will 
>> simply construct them from the default constructor?
>>
>> Some iterators (such as OrIterator) throw an exception if init is 
>> called. How do these iterators get constructed and initialized?
>>
>> If OrIterator can do what I'm asking for, how do I get it the "terms" 
>> and what format do they come in? You mentioned JEXL expressions, but 
>> I haven't seen anything about them in the documentation.
>>
>> As for my statement about the OrIterator and multiple rows, the 
>> comments on the compareTo for OrIterator.TermSource state "If your 
>> implementation can have more than one row in a tablet, you must 
>> compare row key here first, then column qualifier." But the code does 
>> not do so. It may be that I'm just not fully understanding the code, 
>> however.
>>
>> Finally, I'm actually trying to do something a little more complex 
>> than just what I described below. This reply is already too long and 
>> had too many questions in it, but I'll get more detail out after I 
>> have a better handle on how the iterator framework works.
>>
>>
>> Thanks,
>>
>> Tejay
>>
>> *From:*William Slacum [mailto:wilhelm.von.cloud@accumulo.net]
>> *Sent:* Wednesday, August 22, 2012 3:00 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* EXTERNAL: Re: Custom Iterators
>>
>> An or clause should be able to handle an enumeration of values, as 
>> that's supported in a JEXL expression. It would not, however, 
>> surprise me if those iterators could not handle multiple rows in a 
>> tablet. If you can reproduce that, please file a ticket. There will 
>> be a large update occurring to the Wiki example in the near future.
>>
>> Do you have any specific questions about how you should structure 
>> your iterator or the contract? Making a tutorial has been on my to do 
>> list, but we all know how to do lists end up...
>>
>> The big things to remember are:
>>
>> 1) The call order: Your iterator will be created via the default 
>> constructor, init() will be called, then seek(). After seek() is 
>> called, your iterator should have a top if there is data available. A 
>> client then can call hasTop(), getTopKey() and getTopValue() to check 
>> and retrieve data (similar to hasNext() and next()) and then next to 
>> advance the pointer.
>>
>> 2) Your iterator can be destroyed during a scan and then 
>> reconstructed, being passed in the last key returned to the client as 
>> the start of the range.
>>
>> 3) You can have multiple sources feed into a single iterator in a 
>> tree like fashion by clone()'ing the source passed in to init.
>>
>> On Wed, Aug 22, 2012 at 1:41 PM, Cardon, Tejay E 
>> <tejay.e.cardon@lmco.com <mailto:tejay.e.cardon@lmco.com>> wrote:
>>
>> All,
>>
>> I'm interested in writing a custom iterator, and I've been looking 
>> for documentation on how to do so. Thus far, I've not been able to 
>> find anything beyond the java docs in SortedKeyValueIterator and a 
>> few other sub-classes. A few of the examples use Iterators, but 
>> provide no real info on how to properly implement one. Is there 
>> anywhere to find general guidance on the iterator stack?
>>
>> (If you're interested)
>>
>> Specifically, for those that are curious, I'm trying to implement 
>> something similar to the wikisearch example, but with some key 
>> differences. In my case, I've got a file with various attributes that 
>> being indexed. So for each file there are 5 attributes, and each 
>> attribute has a fixed number of possible values. For example (totally 
>> made up):
>>
>> personID, gender, hair color, country, race, personRecord
>>
>> Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
>>
>> AND
>> Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>>
>> A typical query would be:
>>
>> Give me the personRecord for all people with:
>>
>> Gender: male &
>>
>> Hair color: blond or brown &
>>
>> Country: USA or England or china or korea &
>>
>> Race: white or oriental
>>
>> The existing Iterators used in the wikisearch example are unable to 
>> handle the "or" clauses in each attribute.
>>
>> The OrIterator doesn't appear to handle the possibility more than one 
>> row per tablet
>>
>> Thanks,
>>
>> Tejay Cardon
>>

Mime
View raw message