accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Geospatial + Partitioned Index
Date Fri, 16 Jan 2015 20:37:43 GMT
Hi, Josh,

Thanks for your response. I think I should clarify something. When I said,
"the client would just scan (-inf, +inf)", I didn't mean that the net
effect would be to read all data. I just meant that my custom Iterator
would seek() to ranges which are a function of its configuration and its
knowledge of the partitioning scheme, just like the IntersectingIterator.
Except that instead of its configuration defining a set of keyword terms,
it would define a set of disjoint intervals on a space-filling curve.

My understanding is that setting the scan range to (-inf,+inf) in this case
is just a way to tell Accumulo, "run this scan across all tablets".

-Russ

On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Russ Weeks wrote:
>
>> Hey, all,
>>
>> I'm looking at switching my geospatial index to a partitioned index to
>> smooth out some hotspots. So for any query, I'll have a bunch of ranges
>> representing intervals on a Hilbert curve, plus a bunch of partitions,
>> each of which needs to be scanned for every range.
>>
>> The way that the (excellent!) Accumulo Recipes geospatial store
>> addresses this is to take the product of the partitions and the curve
>> intervals[1]. It seems like an alternative would be to encode the curve
>> intervals as a property of a custom iterator (I need one anyways to
>> filter out extraneous points from the search area) and then the client
>> would just scan (-inf, +inf), which I think is more typical when
>> querying a partitioned index?
>>
>
> I'm no expert on storing geo-spatial data, but having to scan (-inf,+inf)
> on a table for a query is typically the reason people deal with the pain of
> hot-spotting, although it is the easiest to implement.
>
> If you can be "tricky" in how you're encoding your data in the row such
> that you can reduce the search space over your partitioned index, you can
> try to get the best of both worlds (avoid reading all data and still get a
> good distribution).
>
> Since that was extremely vague, here's an example: say you had a text
> index and wanted to look up the word "the" and your index had 100
> partitions, [0,99]. If you knew that it was only possible for "the" to show
> up on partitions 5, 27 and 83 (typically by use of some hashing function),
> you could drastically reduce your search space while still avoiding hot
> spotting on a single server.
>
>  Can anybody comment on which approach is preferred? Is it common to
>> expose the number of partitions in the index and the encoding of those
>> partitions to client code? Am I needlessly worried that taking the
>> product of the curve intervals and the partitions will produce too many
>> ranges?
>>
>
> In the trivial sense, the client doesn't need to know the partitions and
> would just scan the entire index like you said earlier. You could also
> track the partitions that you have created in a separate table and the
> client could read that table to know ahead of time (if you have a reason to
> do so in your implementation).
>
> Depending on the amount of data you have, lots of ranges to check could
> take some time. YMMV
>
>
>  Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/calrissian/accumulo-recipes/blob/master/
>> store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/
>> geospatialstore/impl/AccumuloGeoSpatialStore.java#L190
>>
>

Mime
View raw message