accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: strategies beyond intersecting iterators?
Date Sun, 01 Jul 2012 23:27:07 GMT
Since I had started a response, but Bill beat me to it, let me reiterate.

The tear-down is more for assuring responsiveness when multiple scans 
are happening at one time. There's a buffer between TabletServer(s) and 
the client which (if memory serves) it's filled, the scan session is a 
candidate to be torn down, and later recreated.

To avoid duplicate work by your Accumulo iterators, the last key the 
iterators returned is maintained by Accumulo.

For example, if you started a scan with a Range:

(-inf, +inf)

Say you scanned 2000/10000 keys in a table of monotonically increasing 
Keys where only the row is populated. The buffer was filled, the 
iterators torn down, and re-created some amount of time later. Instead 
of getting the (-inf, +inf) range again, you would then get the range:

(2000, +inf)

Meaning, the initial infinite start key would be replaced with a start 
key which was the last key your previous scan returned, non-inclusive.

In short, it's good practice to try to keep Accumulo iterators from 
holding on to state in memory, otherwise you may get stuck creating the 
same in-memory members on your iterators repeatedly. See ACCUMULO-625 
for some thoughts about trying to avoid this lost-state issue.

- Josh

On 07/01/2012 05:18 PM, William Slacum wrote:
> By iterator stack I am referring to the Accumulo iterators. Resource 
> sharing among scan sessions is implemented by destroying a user scan 
> session and eventually recreating the iterator stack. The new stack is 
> then seek'd to the last key returned by the entire stack. If you were 
> holding some state, such as a set of keys, it would be rebuilt every 
> time the stack is created.
> On Jul 1, 2012 5:55 PM, "Sukant Hajra" < 
> <>> wrote:
>     Excerpts from William Slacum's message of Thu Jun 28 16:04:32
>     -0500 2012:
>     >
>     > You're pretty much on the spot regarding two aspects about the
>     current
>     > IntersectingIterator:
>     >
>     > 1- It's not really extensible (there are hooks for building doc IDs,
>     > but you still need the same `partition term: docId` key structure)
>     > 2- Its main strength is that it can do the merges of sorted lists of
>     > doc IDs based on equality expressions (ie, `author=="bob" and
>     > day=="20120627"`)
>     >
>     > Fortunately, the logic isn't very complicated for re-creating the
>     > merging stuff. Personally, I think it's easy enough to separate the
>     > logic of joining N streams of iterator results from the actual
>     > scanning. Unfortunately, this would be left up to you to do at the
>     > moment :)
>     >
>     > You could do range searches by consuming sets of values and sorting
>     > all of the docIds in that range by throwing them into a TreeSet.
>     That
>     > would let you emit doc IDs in a globally sorted order for the given
>     > range of terms.
>     I understand everything above, I think.  Thanks for the prompt reply.
>     > This can get problematic if the range ends up being very large
>     because your
>     > iterator stack may periodically be destroyed and rebuilt.
>     This particular statement confused me.  When you said TreeSet,
>     you're talking
>     about a straight-forward in-memory collection from java.util or
>     similar, right?
>     Because I'm confused about which "iterator stack may periodically
>     be destroyed
>     and rebuilt."  It sounds like we're talking about some garbage
>     collection
>     specific to Accumulo.  Am I missing something here?
>     -Sukant

View raw message