accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bennight <>
Subject Re: Seeking Iterator
Date Sat, 10 Jan 2015 04:00:48 GMT
Russ's solution is generally the way to go.

In order to implement your original methodology you would need to query
accumulo for information about the key ranges associated with each split
point, decompose your "quadtree" until you ended up with a quadrant that
was completely encapsulated within a particular split (i.e. min/max, then
have an iterator be responsible for the further decomposition/refinement
(you would need to serialize and send a portion of your query range to the
iterator as well as the decomposition logic).
There would then be issues where some of the quadrants cross splits, so you
would end up with duplicate decomposition work. (i.e. a bounding box is
likely going to have discontinuities in it when translated to a 1d range)

In the end all you would be saving is the up-front decomposition cost -
which is generally trivial unless start talking very large numbers of
precision (64 bit geohashes for 2 dimensions, etc) - and in that case the
number of ranges you generate anyway would be a bit obscene, so you would
want to heuristically roll them up.

In order to optimize accumulo input splits we tackled issues similar to
this problem (optimizing index ranges and rfile split points), and you can
see some of the logic here:


Also, assuming this is geospatial in nature, you probably should have at
least one test case for date line handling - either document it's not
allowed, or determine how you want to handle a bounding box that crosses
the dateline  (typically some variation of breaking it down into two or
more bounding areas on either side, and combining the results)

On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <> wrote:

> That’s would work well enough and is my next choice.
>  The thought was, rows are stored in increasing order, so as long as I
> know when I walked off the edge, and flag the iterator as empty it’d be
> good.  I’m just chasing the optimal in this case, but if it doesn’t exist,
> oh well.
> Thank you for the reference link, it’s very helpful.
> --
> Eugene Cheipesh
> From: Russ Weeks <> <>
> Reply: <>>
> <>
> Date: January 9, 2015 at 6:48:47 PM
> To: <>>
> <>
> Subject:  Re: Seeking Iterator
>  Hi, Eugene,
> I think the conventional approach is to decompose your search area
> (bounding box?) into a set of scan ranges that introduce minimal extraneous
> curve segments, and then pass all those scan ranges into a BatchScanner.
> The excellent Accumulo Recipes site has an example[1]. Does this approach
> not work for you?
> In general, your custom iterator should never try to seek to a row id
> different from the current row id, because that row could be hosted by a
> different tablet server.
> -Russ
> 1:
> On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <>
> wrote:
>>  Hello,
>>  I am attempting to write an Iterator based on a Z-curve index to search
>> through multi-dimensional data. Essentially, given a record that I have
>> encountered that is in the index range not in the multi-demensional query
>> range I have a way to generate the next candidate record, potentially far
>> ahead of the current point.
>>  Ideally I would be able to refine my search range with subsequent calls
>> to seek(). It appears that Accumulo will create an iterator for every RFile
>> (or some split other split point). The beginning of the range argument to
>> seek will be the record at beginning of this split (which is good), however
>> all instances of the iterator have the same, global range end (which is
>> bad).
>>  I need to avoid the case where I seek past the range boundary of each
>> individual iterator instance and throw a NullPointerException. Is there any
>> way to get enough information to achieve this?
>>  Thank you,
>>  --
>> Eugene Cheipesh

View raw message