accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Iezzi, Adam [USA]" <iezzi_a...@bah.com>
Subject RE: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Wed, 19 Jun 2013 02:14:03 GMT
All,

Thank you for all of the replies. To answer some of the questions:

Q: You say you have point data. Are time series geographically fixed, with only the time dimension
changing? ... or are the time series moving in space-time?
A: The time series will be moving in space-time; therefore, the dataset is geologic. 

Q: If you have time series (identified by <id>) moving in space-time, then I would add
an indirection.
A: Our dataset is very similar to what you describe. Each geospatial point and time stamp
is defined by an id.  Since I'm new to the Accumulo world, I'm not very familiar with this
pattern/approach in table design. But, I will look around now that I have some guidance. 

Overall, I think I need to create a space-time hash of my dataset, but the biggest question
I have is, "what time span do I use?". At the moment, I only have a years' worth of data;
therefore, my MIN_DATE = Jan 01 and MAX_DATE = Dec 31. But we obviously expect this data to
continue to grow; therefore, would want to account for additional data in the future.

Thanks again for all of the guidance. I will digest some of the comments and will report back.

Adam

-----Original Message-----
From: Kurt Christensen [mailto:hoodel@hoodel.com] 
Sent: Tuesday, June 18, 2013 8:54 PM
To: user@accumulo.apache.org
Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)


An effective optimization strategy will be largely influenced by the nature of your data.

You say you have point data. Are time series geographically fixed, with only the time dimension
changing? ... or are the time series moving in space-time?

I was going to suggest a 3-D approach, bit-interleaving your space and time [modulo timespan]
together ( or point-tree, or octtree, or k-d trie, or r-d trie ). The trick there is to pick
a time span large enough so that any interval you query is small relative to the time span,
but small enough so that you don't waste a bunch (up to an eighth) of your usable hash values
with no useful time data (i.e. populate your most significant bits). This would work if your
data were geographically fixed, but changing only in time. If your time span is geologic,
you might want to use a logarithmic time scale.

If you have time series (identified by <id>) moving in space-time, then I would add
an indirection. Use the space-time hash to determine the IDs intersecting your zone and then
query again, using the IDs to pull out the time series, filtering with your interator, perhaps
using the native timestamp field.

I hope that helps. Good luck.

Kurt

BTW: 50% filtering isn't really that inefficient. - kkc


On 6/18/13 12:36 AM, Jared Winick wrote:
> Have you considered a "geohash" of all 3 dimensions together and using 
> that as the RowID? I have never implemented a geohash exactly, but I 
> do know it is possible to build a z-order curve on more than 2 
> dimensions, which may be what you want considering that it sounds like 
> all your queries are in 3-dimensions.
>
>
> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA] <iezzi_adam@bah.com 
> <mailto:iezzi_adam@bah.com>> wrote:
>
>     I've been asked by my client to store a dataset which contains a
>     time series and geospatial coordinates (points) in Accumulo. At
>     the moment, we have a very dense data stored in Accumulo using the
>     following table schema:
>
>     Row ID: <geohash>_<reverse timestamp>
>
>     Family: <id >
>
>     Qualifier: attribute
>
>     Value: <value>
>
>     We are salting our RowID's with a geohash to prevent hot spotting.
>     When we query the data, we use a prefix scan (center tile and
>     eight neighbors), then using an Iterator to filter out the
>     outliers (points and time). Unfortunately, we've noticed some
>     performance issues with this approach in that it seems as the
>     initial prefix scan brings back a ton of data, forcing the
>     iterators to filter out a significant amount of outliers. E.g.
>     more than 50% is being filtered out, which seems inefficient to
>     us. Unfortunately for us, our users will always query by space and
>     time, making them equally important for each query. Because of the
>     time series component to our data, we're often bringing back a
>     significant amount of data for each given point. Each point can
>     have ten entries due to the time series, making our data set very
>     very dense.
>
>     The following are some options we're considering:
>
>     1. Salt a master table with an ID rather than the geohash
>     <id>_<reverse timestamp>, and then create a spatial index table.
>     If we choose this option, I assume we would scan the index first,
>     then use a batch scanner with the ID from the first query.
>     Unfortunately, I still see us filtering out a significant amount
>     of data using this approach.
>
>     2. Keep the table design as is, and maybe a RegExFilter via a
>     custom Iterator.
>
>     3. Do something completely different, such as use a Column Family
>     and the temporal aspect of the dataset together in some way.
>
>     Any advice or guidance would be greatly appreciated.
>
>     Thank you,
>
>     Adam
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that you end up being governed
by your inferiors."
--- Plato

Mime
View raw message