Anthony 
So that sounds reasonable  it's sounding like the consensus is a basic
WGS84 (lat/long) type (point only), range query (lower left, upper right),
and a decompose fully method (i.e. only intersecting ranges). (caveat
here that we probably want to cap the precision here, as a full
decomposition of, say, a 31/31 bit zcurve might take a long time).
What's the performance of your query like for a 24bit/24bit (x,y) zcurve
on a full decomposition over a significant portion of the range? I'm
thinking in the interest of keeping it simple/easy it might be best just to
fix the precision and document it. 24 bits at the equator would still give
us a resolution of ~2m (should be worst case) which should be good enough
for most things. (basically 0.00002 would be the max granularity for
longitude).
If it's fast enough I would just cap at 64 (well, maybe 63 for signed
issues) total bits to to keep the key as a long = 31bits/31bits gives a
longitudinal granularity of ~0.00000017  or 1.9cm at the equator. I think
a full decomposition on a 31/31bit grid might be expensive.
(the values above are assuming we normalization of the input value  which
I think is a given for this implementation?)
If we can fix the precision then that takes out most of the variables and
it's pretty straightforward  I think decomp performance is the main
limiting factor.
Jared 
Yep, exactly that  which would be unintuitive behaviour (Getting back
ranges that didn't really match). That said, an inclusive only range set
can be generated; you can model it as a quadtree decomposition and proceed
recursively if your current region ("quad") isn't fully covered by your
selection window  otherwise return the range of your current "quad".
Which, I think, is essentially what Anthony described in his
implementation (though with stack collection instead of recursive; probably
for the best there).
Corey 
So you think the lexicoder would make sense/be reasonable if there is a
performant range query tool to go with it (one that resulted in 0 false
positives)?
On Mon, Jul 28, 2014 at 5:10 PM, Corey Nolet <cjnolet@gmail.com> wrote:
> On Calrissian, we had once considered making a lexicoder for lat/long that
> really transformed the twodimensions (lat/lon) down into a geohash based
> on the zcurve.
>
> The reason we decided against a first class datatype for this is exactly
> the same reason that Anthony brings up in his previous comment the geohash
> only makes sense as a query tool when you have a good way to structure (at
> least as many as possible) the nonsequential ranges you would need in
> order to find all the overlapping regions (with minimal as possible amount
> of falsepositives).
>
>
> On Mon, Jul 28, 2014 at 2:00 PM, Jared Winick <jaredwinick@gmail.com>
> wrote:
>
>> As several people have commented, a single range for a query can produce
>> a lot of false positives that need to be filtered out. I had made this
>> visualization a while back that lets you interactively (clickdrag a
>> bounding box) see this behavior.
>>
>> http://bl.ocks.org/jaredwinick/5073432
>>
>>
>>
>>
>> On Sun, Jul 27, 2014 at 2:15 PM, Anthony Fox <anthonyfox@ccri.com> wrote:
>>
>>> My first thought was just something simple for a first pass  lat/lon >
>>> a single lexicographic dimension  as it would cover most basic use
>>> cases. Precision (number of bits in encoding) could be an arg or a config
>>> variable. For WITHIN/INTERSECTS topological predicates, we need to
>>> decompose the query geometry into the (possibly/probably noncontiguous) 1D
>>> ranges that cover the region in question. GeoMesa has an algorithm to
>>> quickly perform this decomposition that computes the minimum number of
>>> geohash prefixes at different resolutions to fully cover the query
>>> polygon. Basically, it recursively traverses through levels of geohash
>>> resolutions, prioritizing rectangles that intersect the region and
>>> discarding nonintersecting rectangles at the lowest precisions, to produce
>>> a sequence of ranges to scan over. Fully contained rectangles are
>>> discovered at their lowest resolution at which point the algorithm pops the
>>> stack and searches the next prioritized prefix. I think something like
>>> this would definitely need to be ported over and included in a lexicoder
>>> implementation to make it useful. Also, rather than materialize the entire
>>> set of ranges in memory, we can either return a lazy iterator of prefixes
>>> that can be fed into a scanner in batches or we can have a shortcircuit
>>> config that tunes the amount of slop that's tolerable and cuts off the
>>> traversal at a certain level of precision. GeoMesa uses something like the
>>> former on attribute indexes to coordinate parallel scanners on separate
>>> index and record tables.
>>>
>>> Thoughts? I'm inclined to keep the implementation to the bare minimum
>>> necessary for the basic use cases (lat/lon and bbox queries) though I do
>>> think a general dimensionality reducing lexicoder would be very useful.
>>>
>>>
>>>
>>> On Fri, Jul 25, 2014 at 6:51 PM, Chris Bennight <chris@slowcar.net>
>>> wrote:
>>>
>>>> A couple of related issues come up when considering implementing a
>>>> dimensionality reducing encoding  just want to toss those out to see what
>>>> people think the interface might look like.
>>>>
>>>> There's a couple of aspects that could be brought in here, but lets
>>>> keep it simple and considering the original question: (lat/lon) > number.
>>>>
>>>> Desired precision of the binning process
>>>> The more bits we add to the zcurve, the more precise our comparison 
>>>> i.e. a 63 bit key would have more "locations" to sort by than a 24 bit key.
>>>>
>>>> Would you see a reasonable default getting picked, make this user
>>>> configurable, or both? (i.e. default to a value, extended options with a
>>>> new constructor?)
>>>>
>>>> Semantics for turning two lat/long pairs into a range
>>>> I'm extrapolating here, but the only reason I see that locality matters
>>>> is if we want to preserve locality for range searches. The internal
>>>> implementation of the encoding/lexicoding process is going to directly
>>>> impact the implementation of the range query.
>>>> Now sure, someone could encode the lower left point, encode the upper
>>>> right point, and construct a range out of that to pass for a scan, but
>>>> that's going to be wildly inefficient in most cases. See:
>>>> https://dl.dropboxusercontent.com/u/6649380/bbox.png
>>>> If we just lexicode the lower left and upper right we traverse across
>>>> the entire curve  hitting lots of areas that aren't actually in the
>>>> original range.
>>>> Now we can turn a single 2D range into a set of 1D ranges. There is
>>>> some potential tuning here now, as the algorithm has a tradeoff on time to
>>>> compute the ranges (and number of ranges) vs. "slop" (or inclusion of
>>>> ranges which aren't actually in the original query).
>>>> Would you see a static method perhaps on the zcurve lexicoder that
>>>> returns a series of ranges based on an input window? Some other mechanism?
>>>> And in the case of "slop"  would we just document that the ranges
>>>> could actually include values not expected  or would we always fully
>>>> decompose?
>>>>
>>>> 
>>>>
>>>> The other, "making it more complicated" questions would probably
>>>> resolve around generalizing this to a multidimensional lexicoder. I
>>>> would expect the WGS84 (lat/long) encoder would just be a 2D instance with
>>>> 180/+180 90/+90 bounds. There are probably completely different cases
>>>> where it would be useful to have a locality sensitive hash.
>>>> But with the big caveat that this requires more methods  I now need a
>>>> way of defining a range for each of the dimensions (which before were
>>>> defined as lat/lon), and I need an interface/class to pass those dimensions
>>>> to the encoder  and should be able to use those same definitions to decode
>>>> my values on the way out.
>>>>
>>>> I'm not trying to make a mountain out of a molehill  one could
>>>> definitely pick some sane defaults, document behaviour, and put in a WGS84
>>>> specific implementation. I'm just thinking what's the minimum viable bit
>>>> that's needed for this to be useful  and I suspect it's also the range
>>>> decomposition piece  as I suspect *everyone* would really need that (if
>>>> they cared about locality in the first place).
>>>>
>>>> My main question here would be to what extent would you see this going?
>>>> A one off for WGS84, or a more generic multidimensional lexicoder; and
>>>> in either case, what would the thoughts/consensus be for implementation of
>>>> the additional methods (and possibly classes) required?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 25, 2014 at 12:37 PM, Sean Busbey <busbey@cloudera.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Fri, Jul 25, 2014 at 8:45 AM, Anthony F <afccri@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> GeoMesa plans to update to 1.6 as soon as we get past a 1.0 RELEASE
>>>>>> (which is waiting on approval by the Eclipse Foundation's IP review
process
>>>>>> and should happen in a month or so). I think we could break out
a zcurve
>>>>>> lexicoder and contribute before then though. I'll poke around at
the
>>>>>> lexicoder stuff in 1.6 and see what it would take to use that interface
now.
>>>>>>
>>>>>>
>>>>>
>>>>> That sounds great Anthony. Let us know if we can help with anything!
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>> Sean
>>>>>
>>>>
>>>>
>>>
>>
>
