accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <tejay.e.car...@lmco.com>
Subject RE: EXTERNAL: Re: Custom Iterators
Date Thu, 23 Aug 2012 14:59:31 GMT
Marc,
Thanks for the writeup.  It is by far the most comprehensive info I've seen on iterators,
and was very helpful to me.  A couple notes/questions:

You mention that SortedKeyValueIterator implements FileSKVIterator.  I've only looked at the
1.4.1 source, but it appears that the opposite is true.

You also mention that iterators get their source from the init method, but some (like OrIterator)
seem to throw exceptions on that method.  Where do they get their source data, and what are
the API implications of having iterators that reject init (or deep copy for that matter).

Final thought.  If I want to stack several iterators, what's the best way to go about that?
 In other words, I'd like an iterator that I write to be the source to another iterator that
I've written, which in turn may feed yet another that I've written.  Preferably, I'd like
each to be independently re-useable, so I don't want to build that stacking into the source
of any of the iterators themselves.  Is that possible, or would I need some sort of iterator
factory that builds the stacks and then acts as an interface to the fully formed stack?

Thanks,
Tejay
From: Marc Parisi [mailto:marc@accumulo.net]
Sent: Wednesday, August 22, 2012 5:33 PM
To: user@accumulo.apache.org
Subject: EXTERNAL: Re: Custom Iterators

Here's a quick write up

    http://www.accumulo.net/node/1<http://accumulo.net/node/1>
On Wed, Aug 22, 2012 at 8:03 PM, Josh Elser <josh.elser@gmail.com<mailto:josh.elser@gmail.com>>
wrote:
Err, double (triple) reply:

No, you are incorrect. The wikisearch example can handle any arbitrary boolean expression
containing NOT, AND, and OR. As always, I'll preface it the same as Bill did: it *should*
be able to handle them :).

I know that cleaning-up/reworking the Wikisearch code is in the works. I'm just not positive
about the timeframe.

As far as examples, I'd push you to the write-up Eric did after benchmarking the wikisearch
example: http://accumulo.apache.org/example/wikisearch.html

He has some example queries that give the basic idea behind what's supported (minus the NOTs)

On 08/22/2012 05:27 PM, Cardon, Tejay E wrote:

Josh,

Thanks for getting back to me so quickly. I explained in my lengthy reply to William that
the comment on OrIterator.TermSource.compareTo indicates that implementations with more than
one row per tablet need to compare row key first (and that is not being done in this code).
It may be that it's not an issue and I'm simply misunderstanding something. As for the wikisearch
example, as I understood it, it could only handle searches for "anded" terms. If that's not
the case, then an example of an or search would be helpful. In any case, I'd love a deeper
dive on the wikisearch somewhere. I get the source code and a high level explanation of what's
happening, but I'd love a tutorial or something that walks through the classes and explains
how each one contributes to the functionality. Don't consider that a request (that would be
a lot more to ask then I'm willing to ask), but I would certainly find it useful if it does
exist.

Thanks,

Tejay

*From:*Josh Elser [mailto:josh.elser@gmail.com<mailto:josh.elser@gmail.com>]
*Sent:* Wednesday, August 22, 2012 2:53 PM
*To:* user@accumulo.apache.org<mailto:user@accumulo.apache.org>
*Subject:* EXTERNAL: Re: Custom Iterators


What makes you say that the OrIterator cannot handle more than one row per tablet? Can you
provide details?

AFAIK, the OrIterator should work correctly in all cases (e.g. regardless of row distribution
in a tablet). Any issues in the code that prevent it from doing so would be a bug that should
be fixed.

Also, the wikisearch example supports indexing over multiple attributes (and I believe indexes
document metadata in addition to the tokenized document). Is there something unclear that
could be better documented?

On 8/22/12 4:41 PM, Cardon, Tejay E wrote:

    All,

    I'm interested in writing a custom iterator, and I've been looking
    for documentation on how to do so. Thus far, I've not been able to
    find anything beyond the java docs in SortedKeyValueIterator and a
    few other sub-classes. A few of the examples use Iterators, but
    provide no real info on how to properly implement one. Is there
    anywhere to find general guidance on the iterator stack?

    (If you're interested)

    Specifically, for those that are curious, I'm trying to implement
    something similar to the wikisearch example, but with some key
    differences. In my case, I've got a file with various attributes
    that being indexed. So for each file there are 5 attributes, and
    each attribute has a fixed number of possible values. For example
    (totally made up):

    personID, gender, hair color, country, race, personRecord

    Row:binID; ColFam:Attribute_AttributeValue; ColQ:PersonID; Val:blank

    AND
    Row:binID; ColFam:"D"; ColQ:personID; value:personRecord

    A typical query would be:

    Give me the personRecord for all people with:

    Gender: male &

    Hair color: blond or brown &

    Country: USA or England or china or korea &

    Race: white or oriental

    The existing Iterators used in the wikisearch example are unable
    to handle the "or" clauses in each attribute.

    The OrIterator doesn't appear to handle the possibility more than
    one row per tablet

    Thanks,

    Tejay Cardon


Mime
View raw message