accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kurt Christensen <hoo...@hoodel.com>
Subject Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Wed, 19 Jun 2013 00:53:55 GMT

An effective optimization strategy will be largely influenced by the 
nature of your data.

You say you have point data. Are time series geographically fixed, with 
only the time dimension changing? ... or are the time series moving in 
space-time?

I was going to suggest a 3-D approach, bit-interleaving your space and 
time [modulo timespan] together ( or point-tree, or octtree, or k-d 
trie, or r-d trie ). The trick there is to pick a time span large enough 
so that any interval you query is small relative to the time span, but 
small enough so that you don't waste a bunch (up to an eighth) of your 
usable hash values with no useful time data (i.e. populate your most 
significant bits). This would work if your data were geographically 
fixed, but changing only in time. If your time span is geologic, you 
might want to use a logarithmic time scale.

If you have time series (identified by <id>) moving in space-time, then 
I would add an indirection. Use the space-time hash to determine the IDs 
intersecting your zone and then query again, using the IDs to pull out 
the time series, filtering with your interator, perhaps using the native 
timestamp field.

I hope that helps. Good luck.

Kurt

BTW: 50% filtering isn't really that inefficient. - kkc


On 6/18/13 12:36 AM, Jared Winick wrote:
> Have you considered a "geohash" of all 3 dimensions together and using 
> that as the RowID? I have never implemented a geohash exactly, but I 
> do know it is possible to build a z-order curve on more than 2 
> dimensions, which may be what you want considering that it sounds like 
> all your queries are in 3-dimensions.
>
>
> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <iezzi_adam@bah.com 
> <mailto:iezzi_adam@bah.com>> wrote:
>
>     I’ve been asked by my client to store a dataset which contains a
>     time series and geospatial coordinates (points) in Accumulo. At
>     the moment, we have a very dense data stored in Accumulo using the
>     following table schema:
>
>     Row ID: <geohash>_<reverse timestamp>
>
>     Family: <id >
>
>     Qualifier: attribute
>
>     Value: <value>
>
>     We are salting our RowID’s with a geohash to prevent hot spotting.
>     When we query the data, we use a prefix scan (center tile and
>     eight neighbors), then using an Iterator to filter out the
>     outliers (points and time). Unfortunately, we’ve noticed some
>     performance issues with this approach in that it seems as the
>     initial prefix scan brings back a ton of data, forcing the
>     iterators to filter out a significant amount of outliers. E.g.
>     more than 50% is being filtered out, which seems inefficient to
>     us. Unfortunately for us, our users will always query by space and
>     time, making them equally important for each query. Because of the
>     time series component to our data, we’re often bringing back a
>     significant amount of data for each given point. Each point can
>     have ten entries due to the time series, making our data set very
>     very dense.
>
>     The following are some options we’re considering:
>
>     1. Salt a master table with an ID rather than the geohash
>     <id>_<reverse timestamp>, and then create a spatial index table.
>     If we choose this option, I assume we would scan the index first,
>     then use a batch scanner with the ID from the first query.
>     Unfortunately, I still see us filtering out a significant amount
>     of data using this approach.
>
>     2. Keep the table design as is, and maybe a RegExFilter via a
>     custom Iterator.
>
>     3. Do something completely different, such as use a Column Family
>     and the temporal aspect of the dataset together in some way.
>
>     Any advice or guidance would be greatly appreciated.
>
>     Thank you,
>
>     Adam
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that 
you end up being governed by your inferiors."
--- Plato

Mime
View raw message