accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bennight <ch...@slowcar.net>
Subject Re: Z-Curve/Hilbert Curve
Date Mon, 28 Jul 2014 23:59:57 GMT
Anthony -
So that sounds reasonable - it's sounding like the consensus is a basic
WGS84 (lat/long) type (point only), range query (lower left, upper right),
and a decompose fully method  (i.e. only intersecting ranges).  (caveat
here that we probably want to cap the precision here, as a full
decomposition of, say, a 31/31 bit z-curve might take a long time).

What's the performance of your query like for a 24bit/24bit (x,y) z-curve
on a full decomposition over a significant portion of the range?  I'm
thinking in the interest of keeping it simple/easy it might be best just to
fix the precision and document it.  24 bits at the equator would still give
us a resolution of ~2m (should be worst case) which should be good enough
for most things. (basically 0.00002 would be the max granularity for
longitude).

If it's fast enough I would just cap at 64 (well, maybe 63 for signed
issues) total bits to to keep the key as a long = 31bits/31bits gives a
longitudinal granularity of ~0.00000017 - or 1.9cm at the equator.  I think
a full decomposition on a 31/31bit grid might be expensive.

(the values above are assuming we normalization of the input value - which
I think is a given for this implementation?)

If we can fix the precision then that takes out most of the variables and
it's pretty straightforward - I think decomp performance is the main
limiting factor.

Jared -
Yep, exactly that - which would be unintuitive behaviour (Getting back
ranges that didn't really match).  That said, an inclusive only range set
can be generated; you can model it as a quadtree decomposition and proceed
recursively if your current region ("quad") isn't fully covered by your
selection window - otherwise return the range of your current "quad".
 Which, I think, is essentially what Anthony described in his
implementation (though with stack collection instead of recursive; probably
for the best there).


Corey -
So you think the lexicoder would make sense/be reasonable if there is a
performant  range query tool to go with it (one that resulted in 0 false
positives)?



On Mon, Jul 28, 2014 at 5:10 PM, Corey Nolet <cjnolet@gmail.com> wrote:

> On Calrissian, we had once considered making a lexicoder for lat/long that
> really transformed the two-dimensions (lat/lon) down into a geohash based
> on the z-curve.
>
> The reason we decided against a first class data-type for this is exactly
> the same reason that Anthony brings up in his previous comment- the geohash
> only makes sense as a query tool when you have a good way to structure (at
> least as many as possible) the non-sequential ranges you would need in
> order to find all the overlapping regions (with minimal as possible amount
> of false-positives).
>
>
> On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <jaredwinick@gmail.com>
> wrote:
>
>> As several people have commented, a single range for a query can produce
>> a lot of false positives that need to be filtered out. I had made this
>> visualization a while back that lets you interactively (click-drag a
>> bounding box) see this behavior.
>>
>> http://bl.ocks.org/jaredwinick/5073432
>>
>>
>>
>>
>> On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <anthonyfox@ccri.com> wrote:
>>
>>> My first thought was just something simple for a first pass - lat/lon ->
>>> a single lexicographic dimension -  as it would cover most basic use
>>> cases.  Precision (number of bits in encoding) could be an arg or a config
>>> variable.  For WITHIN/INTERSECTS topological predicates, we need to
>>> decompose the query geometry into the (possibly/probably non-contiguous) 1D
>>> ranges that cover the region in question.  GeoMesa has an algorithm to
>>> quickly perform this decomposition that computes the minimum number of
>>> geohash prefixes at different resolutions to fully cover the query
>>> polygon.  Basically, it recursively traverses through levels of geohash
>>> resolutions, prioritizing rectangles that intersect the region and
>>> discarding non-intersecting rectangles at the lowest precisions, to produce
>>> a sequence of ranges to scan over.  Fully contained rectangles are
>>> discovered at their lowest resolution at which point the algorithm pops the
>>> stack and searches the next prioritized prefix.  I think something like
>>> this would definitely need to be ported over and included in a lexicoder
>>> implementation to make it useful.  Also, rather than materialize the entire
>>> set of ranges in memory, we can either return a lazy iterator of prefixes
>>> that can be fed into a scanner in batches or we can have a short-circuit
>>> config that tunes the amount of slop that's tolerable and cuts off the
>>> traversal at a certain level of precision.  GeoMesa uses something like the
>>> former on attribute indexes to coordinate parallel scanners on separate
>>> index and record tables.
>>>
>>> Thoughts?  I'm inclined to keep the implementation to the bare minimum
>>> necessary for the basic use cases (lat/lon and bbox queries) though I do
>>> think a general dimensionality reducing lexicoder would be very useful.
>>>
>>>
>>>
>>> On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <chris@slowcar.net>
>>> wrote:
>>>
>>>> A couple of related issues come up when considering implementing a
>>>> dimensionality reducing encoding -- just want to toss those out to see what
>>>> people think the interface might look like.
>>>>
>>>> There's a couple of aspects that could be brought in here, but lets
>>>> keep it simple and considering the original question: (lat/lon) -> number.
>>>>
>>>> --Desired precision of the binning process
>>>> The more bits we add to the z-curve, the more precise our comparison -
>>>> i.e. a 63 bit key would have more "locations" to sort by than a 24 bit key.
>>>>
>>>> Would you see a reasonable default getting picked, make this user
>>>> configurable, or both? (i.e. default to a value, extended options with a
>>>> new constructor?)
>>>>
>>>> --Semantics for turning two lat/long pairs into a range
>>>> I'm extrapolating here, but the only reason I see that locality matters
>>>> is if we want to preserve locality for range searches.  The internal
>>>> implementation of the encoding/lexicoding process is going to directly
>>>> impact the implementation of the range query.
>>>> Now sure, someone could encode the lower left point, encode the upper
>>>> right point, and construct a range out of that to pass for a scan, but
>>>> that's going to be wildly inefficient in most cases. See:
>>>> https://dl.dropboxusercontent.com/u/6649380/bbox.png
>>>> If we just lexicode the lower left and upper right we traverse across
>>>> the entire curve - hitting lots of areas that aren't actually in the
>>>> original range.
>>>> Now we can turn a single 2D range into a set of 1D ranges.   There is
>>>> some potential tuning here now, as the algorithm has a tradeoff on time to
>>>> compute the ranges (and number of ranges) vs.  "slop"  (or inclusion of
>>>> ranges which aren't actually in the original query).
>>>> Would you see a static method perhaps on the z-curve lexicoder that
>>>> returns a series of ranges based on an input window?  Some other mechanism?
>>>> And in the case of "slop" - would we just document that the ranges
>>>> could actually include values not expected - or would we always fully
>>>> decompose?
>>>>
>>>> ------------------
>>>>
>>>> The other, "making it more complicated" questions would probably
>>>> resolve around generalizing this to a multi-dimensional lexicoder.   I
>>>> would expect the WGS84 (lat/long) encoder would just be a 2D instance with
>>>> -180/+180 -90/+90 bounds.   There are probably completely different cases
>>>> where it would be useful to have a locality sensitive hash.
>>>> But with the big caveat that this requires more methods - I now need a
>>>> way of defining a range for each of the dimensions (which before were
>>>> defined as lat/lon), and I need an interface/class to pass those dimensions
>>>> to the encoder - and should be able to use those same definitions to decode
>>>> my values on the way out.
>>>>
>>>> I'm not trying to make a mountain out of a molehill - one could
>>>> definitely pick some sane defaults, document behaviour, and put in a WGS84
>>>> specific implementation.  I'm just thinking what's the minimum viable bit
>>>> that's needed for this to be useful - and I suspect it's also the range
>>>> decomposition piece - as I suspect *everyone* would really need that (if
>>>> they cared about locality in the first place).
>>>>
>>>> My main question here would be to what extent would you see this going?
>>>>  A one off for WGS84, or a more generic multi-dimensional lexicoder;  and
>>>> in either case, what would the thoughts/consensus be for implementation of
>>>> the additional methods (and possibly classes) required?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <busbey@cloudera.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <afccri@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE
>>>>>> (which is waiting on approval by the Eclipse Foundation's IP review
process
>>>>>> and should happen in a month or so).  I think we could break out
a z-curve
>>>>>> lexicoder and contribute before then though.  I'll poke around at
the
>>>>>> lexicoder stuff in 1.6 and see what it would take to use that interface
now.
>>>>>>
>>>>>>
>>>>>
>>>>> That sounds great Anthony. Let us know if we can help with anything!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sean
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message