accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Iezzi, Adam [USA]" <iezzi_a...@bah.com>
Subject Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Tue, 18 Jun 2013 01:56:52 GMT
I've been asked by my client to store a dataset which contains a time series and geospatial
coordinates (points) in Accumulo. At the moment, we have a very dense data stored in Accumulo
using the following table schema:

Row ID:                <geohash>_<reverse timestamp>
Family:                  <id >
Qualifier:             attribute
Value:                   <value>

We are salting our RowID's with a geohash to prevent hot spotting. When we query the data,
we use a prefix scan (center tile and eight neighbors), then using an Iterator to filter out
the outliers (points and time). Unfortunately, we've noticed some performance issues with
this approach in that it seems as the initial prefix scan brings back a ton of data, forcing
the iterators to filter out a significant amount of outliers. E.g. more than 50% is being
filtered out, which seems inefficient to us. Unfortunately for us, our users will always query
by space and time, making them equally important for each query. Because of the time series
component to our data, we're often bringing back a significant amount of data for each given
point. Each point can have ten entries due to the time series, making our data set very very
dense.

The following are some options we're considering:


1.       Salt a master table with an ID rather than the geohash <id>_<reverse timestamp>,
and then create a spatial index table. If we choose this option, I assume we would scan the
index first, then use a batch scanner with the ID from the first query. Unfortunately, I still
see us filtering out a significant amount of data using this approach.

2.       Keep the table design as is, and maybe a RegExFilter via a custom Iterator.

3.       Do something completely different, such as use a Column Family and the temporal aspect
of the dataset together in some way.

Any advice or guidance would be greatly appreciated.

Thank you,

Adam

Mime
View raw message