accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Cheipesh <echeip...@gmail.com>
Subject Re: Seeking Iterator
Date Mon, 12 Jan 2015 20:23:34 GMT
That is the idea I am playing with, optimizing subsequent calls to .next vs .reseek. 

Using the VersioningIterator as an I was able to get reseek working, the trick turned out
to be to check getSource.hasTop a little more carefully. Thank you for that pointer. Disappointingly
enough there does not appear to be a huge difference in performance from a Filter that performs
the same checking without seeking forward.

I am attempting to use accumulo tracing to come up with an explanation but results are spotty.
Query runtime is ~2.3s for 200k entries and I am only capturing tiny fraction of that through
tracing. I am using AccumuloInputFormat to pull results into a spark job. Would I get better
results if I was creating the BatchScanner directly? Otherwise what would be the best method
to debug the iterator and query performance?

-- 
Eugene Cheipesh

From: Russ Weeks <rweeks@newbrightidea.com>
Reply: user@accumulo.apache.org <user@accumulo.apache.org>>
Date: January 9, 2015 at 11:32:13 PM
To: user@accumulo.apache.org <user@accumulo.apache.org>>
Subject:  Re: Seeking Iterator  

On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ctubbsii@apache.org> wrote:
Another optimization you can try: instead of always seeking to the computed next, you can
advance internally inside your iterator by calling its source's next method a few times. If
you don't reach the next element that you would have seek'd to in some reasonable number of
iterations, you can then seek. This also is a strategy that is hard to optimize: Do I need
to advance, on average 3 or 20 or 10000000  keys? How many before it would have been more
efficient to just seek? There's no easy answer. Experimentation helps.

The VersioningIterator has a good example of this approach: https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95

-Russ
 


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <echeipesh@gmail.com> wrote:
That’s would work well enough and is my next choice.

 The thought was, rows are stored in increasing order, so as long as I know when I walked
off the edge, and flag the iterator as empty it’d be good.  I’m just chasing the optimal
in this case, but if it doesn’t exist, oh well.

Thank you for the reference link, it’s very helpful. 

-- 
Eugene Cheipesh

From: Russ Weeks <rweeks@newbrightidea.com>
Reply: user@accumulo.apache.org <user@accumulo.apache.org>>
Date: January 9, 2015 at 6:48:47 PM
To: user@accumulo.apache.org <user@accumulo.apache.org>>
Subject:  Re: Seeking Iterator

Hi, Eugene,

I think the conventional approach is to decompose your search area (bounding box?) into a
set of scan ranges that introduce minimal extraneous curve segments, and then pass all those
scan ranges into a BatchScanner. The excellent Accumulo Recipes site has an example[1]. Does
this approach not work for you?

In general, your custom iterator should never try to seek to a row id different from the current
row id, because that row could be hosted by a different tablet server.

-Russ

1: https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33

On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <echeipesh@gmail.com> wrote:
Hello,

I am attempting to write an Iterator based on a Z-curve index to search through multi-dimensional
data. Essentially, given a record that I have encountered that is in the index range not in
the multi-demensional query range I have a way to generate the next candidate record, potentially
far ahead of the current point.

Ideally I would be able to refine my search range with subsequent calls to seek(). It appears
that Accumulo will create an iterator for every RFile (or some split other split point). The
beginning of the range argument to seek will be the record at beginning of this split (which
is good), however all instances of the iterator have the same, global range end (which is
bad).

I need to avoid the case where I seek past the range boundary of each individual iterator
instance and throw a NullPointerException. Is there any way to get enough information to achieve
this?

Thank you,

-- 
Eugene Cheipesh




Mime
View raw message