accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sukant Hajra" <qn2b6c2...@snkmail.com>
Subject more questions about IndexedDocIterators
Date Sun, 15 Jul 2012 23:05:26 GMT
Hi all,

I have a mixed bag of questions to follow up on an earlier post inquiring about
intersecting iterators now that I've done some prototyping:


1. Do FamilyIntersectingIterators work in 1.3.4?
------------------------------------------------

Does anyone know if FamilyIntersectingIterators were useable as far back as
1.3.4?  Or am I wasting my time on them at this old version (and need to
upgrade)?

I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
currently have a hung thread in my attempt to use a FamilyIntersectingIterator
with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove some
oddly designed static configuration.

If FamilyIntersectingIterators were buggy, were there sufficient work-arounds
to get some use out of them in 1.3.4?

Unfortunately, I need to jump through some political/social hoops to upgrade,
but if it's got to be done, then I'll do what I have to.


2. Is this approach reasonable?
-------------------------------

We're trying to be clever with our use of indexed docs.  We're less interested
in searching over a large corpus of data in parallel, and more interested in
doing some server-side joins in a data-local way (to reduce client burden and
network traffic).  So we're heavily "sharding" our documents (billions of
shards) and using range constraints on the iterator to hone in on exactly one
shard (new Range(shardId, shardId)).

Let me give you a sense for what we're doing.  In one use case, we're using
document-indexed iterators to accomodate both per-author and by-time accesses
of a per-document commit log.  So we're sharding by document ID (and we have
billions of documents).  Then we use the author ID as terms for each commit
(one term per commit entry).  We use a reverse timestamp for the doc type, so
we get back these entries in reverse time order.  In this way, we can scan the
log for the entire document by time with plan iterators, and for a specific
author with a document-indexed iterator (with a server-side join to the commit
log entry).  Later on, we may index the log by other features with this
approach.

Is this strategy sane?  Is there precedent for doing it?  Is there a better
alternative?


3. Compressed reverse-timestamp using Unicode tricks?
------------------------------------------------------

I see code in Accumulo like

    // We're past the index column family, so return a term that will sort
    // lexicographically last.  The last unicode character should suffice
    return new Text("\uFFFD");

which gets me thinking that i can probably pull off a impressively compressed,
but still lexically orderd, reverse timestamp using Unicode trickery to get a
gigantic radix.  Is there any precedence for this?  I'm a little worried about
running into corner cases with Unicode encoding.  Otherwise, I think it feels
like a simple algorithm that may not eat up much CPU in translation and might
save disk space at scale.

Or is this optimizing into the noise given compression Accumulo already does
under the covers?


4. Response from IndexedDocIterator not reflecting documentation
----------------------------------------------------------------

I got back results in my prototype that don't line up with the documentation
for a IndexedDocIterator.  For example, here's some data I put into a test
table:

    r:"shardId", cf:"e\0docType", cq:"docId", value:"content"
    r:"shardId", cf:"i", cq:"term\0docType\0docId\0docInfo", value:[]

This is as per the documentation of IndexedDocIterator.java.  What I believe I
should have gotten back from an intersecting iteration was:

    r:"shardId", cf:"i", cq:"docType\0docId\0docInfo", value:"content"

but instead, the column qualifier I actually got was formatted differently:

    r:"shardId", cf:"i", cq:"\0docType\0docId\0", value:"content"

The document info wasn't returned at all, and the column qualifier was
suspiciously prefixed with a null character.

This isn't so horrible, because I didn't have plans to use the document info
anyway.  Actually, I was curious what people were using it for anyway.

Based upon my read of the source code for IndexedDocIterator#parseDocID, I'm
not sure how the document info could possibly be parsed.  I feel the info part
of the index is truly discarded in code.

I can provide sample code if people doubt the integrity of my protoype.  It's
just not compact in it's current form.

Mostly, I want to confirm that this behavior is not due to a user error on my
part.


5. Why not do intersecting iteration of a single term?
------------------------------------------------------

The API throws an exception if you search for only a single term.  Especially
given our strategy our strategy of using doc-indexing for server-side joining
(question 2. above), it seems like supporting a single term lookup makes sense.
Also, with the dynamism of user interaction, you don't always know up-front how
many terms a user is interested in any way.

As a work around, I'm putting in a dummy term with a not-flag.  But this seems
silly to me.  Am I missing the larger picture or abusing the API?


Thanks for the help,
Sukant

Mime
View raw message