accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bennight <>
Subject Re: Z-Curve/Hilbert Curve
Date Fri, 25 Jul 2014 22:51:38 GMT
A couple of related issues come up when considering implementing a
dimensionality reducing encoding -- just want to toss those out to see what
people think the interface might look like.

There's a couple of aspects that could be brought in here, but lets keep it
simple and considering the original question: (lat/lon) -> number.

--Desired precision of the binning process
The more bits we add to the z-curve, the more precise our comparison - i.e.
a 63 bit key would have more "locations" to sort by than a 24 bit key.
Would you see a reasonable default getting picked, make this user
configurable, or both? (i.e. default to a value, extended options with a
new constructor?)

--Semantics for turning two lat/long pairs into a range
I'm extrapolating here, but the only reason I see that locality matters is
if we want to preserve locality for range searches.  The internal
implementation of the encoding/lexicoding process is going to directly
impact the implementation of the range query.
Now sure, someone could encode the lower left point, encode the upper right
point, and construct a range out of that to pass for a scan, but that's
going to be wildly inefficient in most cases. See:
If we just lexicode the lower left and upper right we traverse across the
entire curve - hitting lots of areas that aren't actually in the original
Now we can turn a single 2D range into a set of 1D ranges.   There is some
potential tuning here now, as the algorithm has a tradeoff on time to
compute the ranges (and number of ranges) vs.  "slop"  (or inclusion of
ranges which aren't actually in the original query).
Would you see a static method perhaps on the z-curve lexicoder that returns
a series of ranges based on an input window?  Some other mechanism?
And in the case of "slop" - would we just document that the ranges could
actually include values not expected - or would we always fully decompose?


The other, "making it more complicated" questions would probably resolve
around generalizing this to a multi-dimensional lexicoder.   I would expect
the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180
-90/+90 bounds.   There are probably completely different cases where it
would be useful to have a locality sensitive hash.
But with the big caveat that this requires more methods - I now need a way
of defining a range for each of the dimensions (which before were defined
as lat/lon), and I need an interface/class to pass those dimensions to the
encoder - and should be able to use those same definitions to decode my
values on the way out.

I'm not trying to make a mountain out of a molehill - one could definitely
pick some sane defaults, document behaviour, and put in a WGS84 specific
implementation.  I'm just thinking what's the minimum viable bit that's
needed for this to be useful - and I suspect it's also the range
decomposition piece - as I suspect *everyone* would really need that (if
they cared about locality in the first place).

My main question here would be to what extent would you see this going?  A
one off for WGS84, or a more generic multi-dimensional lexicoder;  and in
either case, what would the thoughts/consensus be for implementation of the
additional methods (and possibly classes) required?

On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <> wrote:

> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <> wrote:
>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE
>> (which is waiting on approval by the Eclipse Foundation's IP review process
>> and should happen in a month or so).  I think we could break out a z-curve
>> lexicoder and contribute before then though.  I'll poke around at the
>> lexicoder stuff in 1.6 and see what it would take to use that interface now.
> That sounds great Anthony. Let us know if we can help with anything!
> --
> Sean

View raw message