accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <tejay.e.car...@lmco.com>
Subject RE: EXTERNAL: Re: Custom Iterators
Date Thu, 23 Aug 2012 14:43:05 GMT
Excellent.  I'll have to look more closely at the wikisearch code then.  That should get me
most of the way to my solution.  Let me layout the next piece of this, and please tell me
if doing this in an Iterator would make sense.

My actual "query" is more than just an Or-ing of index terms/values.  It's actually looking
for a probability of match.  So to expand on the earlier example:

personID, gender, hair color, country, race, personRecord

Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank
AND
Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

My query would be a lookup table where the lookupKey would be attribute/attribute_value combinations
and the lookupValue would be score (probability) for that attribute/attribute_value pair.
 So something like:

Attribute       | Score
Gender:male     | 10
Gender:female   | 30
Hair:brown      | 30
Hair:blond      | 80

I intend to write and iterator that will use that lookup table as input (along with a threshold).
 My iterator would then return only those records where the sum of the scores is greater than
the threshold.  Because the lookup matrix is sparsely populated (no scores under 5 are included),
I would start with an ORing iterator that only returns records that contain at least one attribute
that has a score.  Then, for only those records that have at least some score, I would filter
out any that didn't reach the threshold.

One final iterator would sit at the top of the stack.  It would take the records which passed
the threshold, extract the actual document, run it through a more detailed filter, and return
as a final result only the records which pass this final filter.  

The goal here is to keep all of the processing on the server side, and if possible, do it
all in one stack of iterators so as to avoid passing intermediate results across the network.

Is this a reasonable use of iterators?  Or am I taking an entirely inappropriate approach
to the problem?

Thanks,
Tejay Cardon  



-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com] 
Sent: Wednesday, August 22, 2012 6:04 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

Err, double (triple) reply:

No, you are incorrect. The wikisearch example can handle any arbitrary boolean expression
containing NOT, AND, and OR. As always, I'll preface it the same as Bill did: it *should*
be able to handle them :).

I know that cleaning-up/reworking the Wikisearch code is in the works. 
I'm just not positive about the timeframe.

As far as examples, I'd push you to the write-up Eric did after benchmarking the wikisearch
example: 
http://accumulo.apache.org/example/wikisearch.html

He has some example queries that give the basic idea behind what's supported (minus the NOTs)

On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:
>
> Josh,
>
> Thanks for getting back to me so quickly. I explained in my lengthy 
> reply to William that the comment on OrIterator.TermSource.compareTo 
> indicates that implementations with more than one row per tablet need 
> to compare row key first (and that is not being done in this code). It 
> may be that it's not an issue and I'm simply misunderstanding 
> something. As for the wikisearch example, as I understood it, it could 
> only handle searches for "anded" terms. If that's not the case, then 
> an example of an or search would be helpful. In any case, I'd love a 
> deeper dive on the wikisearch somewhere. I get the source code and a 
> high level explanation of what's happening, but I'd love a tutorial or 
> something that walks through the classes and explains how each one 
> contributes to the functionality. Don't consider that a request (that 
> would be a lot more to ask then I'm willing to ask), but I would 
> certainly find it useful if it does exist.
>
> Thanks,
>
> Tejay
>
> *From:*Josh Elser [mailto:josh.elser@gmail.com]
> *Sent:* Wednesday, August 22, 2012 2:53 PM
> *To:* user@accumulo.apache.org
> *Subject:* EXTERNAL: Re: Custom Iterators
>
> What makes you say that the OrIterator cannot handle more than one row 
> per tablet? Can you provide details?
>
> AFAIK, the OrIterator should work correctly in all cases (e.g. 
> regardless of row distribution in a tablet). Any issues in the code 
> that prevent it from doing so would be a bug that should be fixed.
>
> Also, the wikisearch example supports indexing over multiple 
> attributes (and I believe indexes document metadata in addition to the 
> tokenized document). Is there something unclear that could be better 
> documented?
>
> On 8/22/12 4:41 PM, Cardon, Tejay E wrote:
>
>     All,
>
>     I'm interested in writing a custom iterator, and I've been looking
>     for documentation on how to do so. Thus far, I've not been able to
>     find anything beyond the java docs in SortedKeyValueIterator and a
>     few other sub-classes. A few of the examples use Iterators, but
>     provide no real info on how to properly implement one. Is there
>     anywhere to find general guidance on the iterator stack?
>
>     (If you're interested)
>
>     Specifically, for those that are curious, I'm trying to implement
>     something similar to the wikisearch example, but with some key
>     differences. In my case, I've got a file with various attributes
>     that being indexed. So for each file there are 5 attributes, and
>     each attribute has a fixed number of possible values. For example
>     (totally made up):
>
>     personID, gender, hair color, country, race, personRecord
>
>     Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; 
> Val:blank
>
>     AND
>     Row:binID; ColFam:"D"; ColQ:personID; value:personRecord
>
>     A typical query would be:
>
>     Give me the personRecord for all people with:
>
>     Gender: male &
>
>     Hair color: blond or brown &
>
>     Country: USA or England or china or korea &
>
>     Race: white or oriental
>
>     The existing Iterators used in the wikisearch example are unable
>     to handle the "or" clauses in each attribute.
>
>     The OrIterator doesn't appear to handle the possibility more than
>     one row per tablet
>
>     Thanks,
>
>     Tejay Cardon
>

Mime
View raw message