My first thought was just something simple for a first pass  lat/lon > a
single lexicographic dimension  as it would cover most basic use cases.
Precision (number of bits in encoding) could be an arg or a config
variable. For WITHIN/INTERSECTS topological predicates, we need to
decompose the query geometry into the (possibly/probably noncontiguous) 1D
ranges that cover the region in question. GeoMesa has an algorithm to
quickly perform this decomposition that computes the minimum number of
geohash prefixes at different resolutions to fully cover the query
polygon. Basically, it recursively traverses through levels of geohash
resolutions, prioritizing rectangles that intersect the region and
discarding nonintersecting rectangles at the lowest precisions, to produce
a sequence of ranges to scan over. Fully contained rectangles are
discovered at their lowest resolution at which point the algorithm pops the
stack and searches the next prioritized prefix. I think something like
this would definitely need to be ported over and included in a lexicoder
implementation to make it useful. Also, rather than materialize the entire
set of ranges in memory, we can either return a lazy iterator of prefixes
that can be fed into a scanner in batches or we can have a shortcircuit
config that tunes the amount of slop that's tolerable and cuts off the
traversal at a certain level of precision. GeoMesa uses something like the
former on attribute indexes to coordinate parallel scanners on separate
index and record tables.
Thoughts? I'm inclined to keep the implementation to the bare minimum
necessary for the basic use cases (lat/lon and bbox queries) though I do
think a general dimensionality reducing lexicoder would be very useful.
On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <chris@slowcar.net> wrote:
> A couple of related issues come up when considering implementing a
> dimensionality reducing encoding  just want to toss those out to see what
> people think the interface might look like.
>
> There's a couple of aspects that could be brought in here, but lets keep
> it simple and considering the original question: (lat/lon) > number.
>
> Desired precision of the binning process
> The more bits we add to the zcurve, the more precise our comparison 
> i.e. a 63 bit key would have more "locations" to sort by than a 24 bit key.
>
> Would you see a reasonable default getting picked, make this user
> configurable, or both? (i.e. default to a value, extended options with a
> new constructor?)
>
> Semantics for turning two lat/long pairs into a range
> I'm extrapolating here, but the only reason I see that locality matters is
> if we want to preserve locality for range searches. The internal
> implementation of the encoding/lexicoding process is going to directly
> impact the implementation of the range query.
> Now sure, someone could encode the lower left point, encode the upper
> right point, and construct a range out of that to pass for a scan, but
> that's going to be wildly inefficient in most cases. See:
> https://dl.dropboxusercontent.com/u/6649380/bbox.png
> If we just lexicode the lower left and upper right we traverse across the
> entire curve  hitting lots of areas that aren't actually in the original
> range.
> Now we can turn a single 2D range into a set of 1D ranges. There is some
> potential tuning here now, as the algorithm has a tradeoff on time to
> compute the ranges (and number of ranges) vs. "slop" (or inclusion of
> ranges which aren't actually in the original query).
> Would you see a static method perhaps on the zcurve lexicoder that
> returns a series of ranges based on an input window? Some other mechanism?
> And in the case of "slop"  would we just document that the ranges could
> actually include values not expected  or would we always fully decompose?
>
> 
>
> The other, "making it more complicated" questions would probably resolve
> around generalizing this to a multidimensional lexicoder. I would expect
> the WGS84 (lat/long) encoder would just be a 2D instance with 180/+180
> 90/+90 bounds. There are probably completely different cases where it
> would be useful to have a locality sensitive hash.
> But with the big caveat that this requires more methods  I now need a way
> of defining a range for each of the dimensions (which before were defined
> as lat/lon), and I need an interface/class to pass those dimensions to the
> encoder  and should be able to use those same definitions to decode my
> values on the way out.
>
> I'm not trying to make a mountain out of a molehill  one could definitely
> pick some sane defaults, document behaviour, and put in a WGS84 specific
> implementation. I'm just thinking what's the minimum viable bit that's
> needed for this to be useful  and I suspect it's also the range
> decomposition piece  as I suspect *everyone* would really need that (if
> they cared about locality in the first place).
>
> My main question here would be to what extent would you see this going? A
> one off for WGS84, or a more generic multidimensional lexicoder; and in
> either case, what would the thoughts/consensus be for implementation of the
> additional methods (and possibly classes) required?
>
>
>
>
>
> On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <busbey@cloudera.com> wrote:
>
>>
>> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <afccri@gmail.com> wrote:
>>
>>>
>>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE
>>> (which is waiting on approval by the Eclipse Foundation's IP review process
>>> and should happen in a month or so). I think we could break out a zcurve
>>> lexicoder and contribute before then though. I'll poke around at the
>>> lexicoder stuff in 1.6 and see what it would take to use that interface now.
>>>
>>>
>>
>> That sounds great Anthony. Let us know if we can help with anything!
>>
>>
>>
>> 
>> Sean
>>
>
>
