accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <>
Subject Re: Z-Curve/Hilbert Curve
Date Mon, 28 Jul 2014 21:10:43 GMT
On Calrissian, we had once considered making a lexicoder for lat/long that
really transformed the two-dimensions (lat/lon) down into a geohash based
on the z-curve.

The reason we decided against a first class data-type for this is exactly
the same reason that Anthony brings up in his previous comment- the geohash
only makes sense as a query tool when you have a good way to structure (at
least as many as possible) the non-sequential ranges you would need in
order to find all the overlapping regions (with minimal as possible amount
of false-positives).

On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <> wrote:

> As several people have commented, a single range for a query can produce a
> lot of false positives that need to be filtered out. I had made this
> visualization a while back that lets you interactively (click-drag a
> bounding box) see this behavior.
> On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <> wrote:
>> My first thought was just something simple for a first pass - lat/lon ->
>> a single lexicographic dimension -  as it would cover most basic use
>> cases.  Precision (number of bits in encoding) could be an arg or a config
>> variable.  For WITHIN/INTERSECTS topological predicates, we need to
>> decompose the query geometry into the (possibly/probably non-contiguous) 1D
>> ranges that cover the region in question.  GeoMesa has an algorithm to
>> quickly perform this decomposition that computes the minimum number of
>> geohash prefixes at different resolutions to fully cover the query
>> polygon.  Basically, it recursively traverses through levels of geohash
>> resolutions, prioritizing rectangles that intersect the region and
>> discarding non-intersecting rectangles at the lowest precisions, to produce
>> a sequence of ranges to scan over.  Fully contained rectangles are
>> discovered at their lowest resolution at which point the algorithm pops the
>> stack and searches the next prioritized prefix.  I think something like
>> this would definitely need to be ported over and included in a lexicoder
>> implementation to make it useful.  Also, rather than materialize the entire
>> set of ranges in memory, we can either return a lazy iterator of prefixes
>> that can be fed into a scanner in batches or we can have a short-circuit
>> config that tunes the amount of slop that's tolerable and cuts off the
>> traversal at a certain level of precision.  GeoMesa uses something like the
>> former on attribute indexes to coordinate parallel scanners on separate
>> index and record tables.
>> Thoughts?  I'm inclined to keep the implementation to the bare minimum
>> necessary for the basic use cases (lat/lon and bbox queries) though I do
>> think a general dimensionality reducing lexicoder would be very useful.
>> On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <>
>> wrote:
>>> A couple of related issues come up when considering implementing a
>>> dimensionality reducing encoding -- just want to toss those out to see what
>>> people think the interface might look like.
>>> There's a couple of aspects that could be brought in here, but lets keep
>>> it simple and considering the original question: (lat/lon) -> number.
>>> --Desired precision of the binning process
>>> The more bits we add to the z-curve, the more precise our comparison -
>>> i.e. a 63 bit key would have more "locations" to sort by than a 24 bit key.
>>> Would you see a reasonable default getting picked, make this user
>>> configurable, or both? (i.e. default to a value, extended options with a
>>> new constructor?)
>>> --Semantics for turning two lat/long pairs into a range
>>> I'm extrapolating here, but the only reason I see that locality matters
>>> is if we want to preserve locality for range searches.  The internal
>>> implementation of the encoding/lexicoding process is going to directly
>>> impact the implementation of the range query.
>>> Now sure, someone could encode the lower left point, encode the upper
>>> right point, and construct a range out of that to pass for a scan, but
>>> that's going to be wildly inefficient in most cases. See:
>>> If we just lexicode the lower left and upper right we traverse across
>>> the entire curve - hitting lots of areas that aren't actually in the
>>> original range.
>>> Now we can turn a single 2D range into a set of 1D ranges.   There is
>>> some potential tuning here now, as the algorithm has a tradeoff on time to
>>> compute the ranges (and number of ranges) vs.  "slop"  (or inclusion of
>>> ranges which aren't actually in the original query).
>>> Would you see a static method perhaps on the z-curve lexicoder that
>>> returns a series of ranges based on an input window?  Some other mechanism?
>>> And in the case of "slop" - would we just document that the ranges could
>>> actually include values not expected - or would we always fully decompose?
>>> ------------------
>>> The other, "making it more complicated" questions would probably resolve
>>> around generalizing this to a multi-dimensional lexicoder.   I would expect
>>> the WGS84 (lat/long) encoder would just be a 2D instance with -180/+180
>>> -90/+90 bounds.   There are probably completely different cases where it
>>> would be useful to have a locality sensitive hash.
>>> But with the big caveat that this requires more methods - I now need a
>>> way of defining a range for each of the dimensions (which before were
>>> defined as lat/lon), and I need an interface/class to pass those dimensions
>>> to the encoder - and should be able to use those same definitions to decode
>>> my values on the way out.
>>> I'm not trying to make a mountain out of a molehill - one could
>>> definitely pick some sane defaults, document behaviour, and put in a WGS84
>>> specific implementation.  I'm just thinking what's the minimum viable bit
>>> that's needed for this to be useful - and I suspect it's also the range
>>> decomposition piece - as I suspect *everyone* would really need that (if
>>> they cared about locality in the first place).
>>> My main question here would be to what extent would you see this going?
>>>  A one off for WGS84, or a more generic multi-dimensional lexicoder;  and
>>> in either case, what would the thoughts/consensus be for implementation of
>>> the additional methods (and possibly classes) required?
>>> On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <>
>>> wrote:
>>>> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <> wrote:
>>>>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE
>>>>> (which is waiting on approval by the Eclipse Foundation's IP review process
>>>>> and should happen in a month or so).  I think we could break out a z-curve
>>>>> lexicoder and contribute before then though.  I'll poke around at the
>>>>> lexicoder stuff in 1.6 and see what it would take to use that interface
>>>> That sounds great Anthony. Let us know if we can help with anything!
>>>> --
>>>> Sean

View raw message