accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Geospatial + Partitioned Index
Date Fri, 16 Jan 2015 20:17:44 GMT
Russ Weeks wrote:
> Hey, all,
> I'm looking at switching my geospatial index to a partitioned index to
> smooth out some hotspots. So for any query, I'll have a bunch of ranges
> representing intervals on a Hilbert curve, plus a bunch of partitions,
> each of which needs to be scanned for every range.
> The way that the (excellent!) Accumulo Recipes geospatial store
> addresses this is to take the product of the partitions and the curve
> intervals[1]. It seems like an alternative would be to encode the curve
> intervals as a property of a custom iterator (I need one anyways to
> filter out extraneous points from the search area) and then the client
> would just scan (-inf, +inf), which I think is more typical when
> querying a partitioned index?

I'm no expert on storing geo-spatial data, but having to scan 
(-inf,+inf) on a table for a query is typically the reason people deal 
with the pain of hot-spotting, although it is the easiest to implement.

If you can be "tricky" in how you're encoding your data in the row such 
that you can reduce the search space over your partitioned index, you 
can try to get the best of both worlds (avoid reading all data and still 
get a good distribution).

Since that was extremely vague, here's an example: say you had a text 
index and wanted to look up the word "the" and your index had 100 
partitions, [0,99]. If you knew that it was only possible for "the" to 
show up on partitions 5, 27 and 83 (typically by use of some hashing 
function), you could drastically reduce your search space while still 
avoiding hot spotting on a single server.

> Can anybody comment on which approach is preferred? Is it common to
> expose the number of partitions in the index and the encoding of those
> partitions to client code? Am I needlessly worried that taking the
> product of the curve intervals and the partitions will produce too many
> ranges?

In the trivial sense, the client doesn't need to know the partitions and 
would just scan the entire index like you said earlier. You could also 
track the partitions that you have created in a separate table and the 
client could read that table to know ahead of time (if you have a reason 
to do so in your implementation).

Depending on the amount of data you have, lots of ranges to check could 
take some time. YMMV

> Thanks,
> -Russ
> 1:

View raw message