# accumulo-user mailing list archives

##### Site index · List index
Message view
Top
From Jim Klucar <klu...@gmail.com>
Subject Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Mon, 24 Jun 2013 14:47:27 GMT
```Adam,

Usually with geo-queries points of interest are pretty dense (as you've
stated is your case). The indexing typically used (geohash or z-order) is
efficient for points spread evenly across the earth, which isn't the
typical case (think population density). One method I've heard (never
actually tried myself) is to store points as distances from known
locations. You can then find points close to each other by finding similar
distances to 2 or 3 known locations. The known locations can then be
created and distributed based on your expected point density allowing even
dense areas to be spread evenly across a cluster.

There's plenty of math, table design, and query design work to get it all
working, but I think its feasible.

Jim

On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <hoodel@hoodel.com> wrote:

>
> To clarify: By 'geologic', I was referring to time-scale (like 100s of
> millions of years, with more detail near present, suggesting a log scale).
>
> Your use of id is surprising. Maybe I don't understand what you're trying
> to do.
> From what I was thinking, since you made reference to time-series, no
> efficiency is gained through this id. If, instead the id were for a whole
> time-series, and not each individual point then for each timestamp, you
> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
> timestamp) already organized as time series. ... all with the same row id.
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
> points)
> id, MEAS, name, vis, TIMESTAMP, named_measurement
>
> Alternately, if you wanted rich points, and not individual values:
> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting your
> space-time region)
> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
> column)
>
> If this is way off base from what you are trying to do, please ignore.
>
> Kurt
>
> -----
>
>
> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>
>> All,
>>
>> Thank you for all of the replies. To answer some of the questions:
>>
>> Q: You say you have point data. Are time series geographically fixed,
>> with only the time dimension changing? ... or are the time series moving in
>> space-time?
>> A: The time series will be moving in space-time; therefore, the dataset
>> is geologic.
>>
>> Q: If you have time series (identified by<id>) moving in space-time, then
>> I would add an indirection.
>> A: Our dataset is very similar to what you describe. Each geospatial
>> point and time stamp is defined by an id.  Since I'm new to the Accumulo
>> world, I'm not very familiar with this pattern/approach in table design.
>> But, I will look around now that I have some guidance.
>>
>> Overall, I think I need to create a space-time hash of my dataset, but
>> the biggest question I have is, "what time span do I use?". At the moment,
>> I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and
>> MAX_DATE = Dec 31. But we obviously expect this data to continue to grow;
>> therefore, would want to account for additional data in the future.
>>
>> Thanks again for all of the guidance. I will digest some of the comments
>> and will report back.
>>
>> Adam
>>
>> -----Original Message-----
>> From: Kurt Christensen [mailto:hoodel@hoodel.com]
>> Sent: Tuesday, June 18, 2013 8:54 PM
>> To: user@accumulo.apache.org
>> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo
>> (geo + timeseries)
>>
>>
>> An effective optimization strategy will be largely influenced by the
>> nature of your data.
>>
>> You say you have point data. Are time series geographically fixed, with
>> only the time dimension changing? ... or are the time series moving in
>> space-time?
>>
>> I was going to suggest a 3-D approach, bit-interleaving your space and
>> time [modulo timespan] together ( or point-tree, or octtree, or k-d trie,
>> or r-d trie ). The trick there is to pick a time span large enough so that
>> any interval you query is small relative to the time span, but small enough
>> so that you don't waste a bunch (up to an eighth) of your usable hash
>> values with no useful time data (i.e. populate your most significant bits).
>> This would work if your data were geographically fixed, but changing only
>> in time. If your time span is geologic, you might want to use a logarithmic
>> time scale.
>>
>> If you have time series (identified by<id>) moving in space-time, then I
>> would add an indirection. Use the space-time hash to determine the IDs
>> intersecting your zone and then query again, using the IDs to pull out the
>> time series, filtering with your interator, perhaps using the native
>> timestamp field.
>>
>> I hope that helps. Good luck.
>>
>> Kurt
>>
>> BTW: 50% filtering isn't really that inefficient. - kkc
>>
>>
>> On 6/18/13 12:36 AM, Jared Winick wrote:
>>
>>
>>> Have you considered a "geohash" of all 3 dimensions together and using
>>> that as the RowID? I have never implemented a geohash exactly, but I
>>> do know it is possible to build a z-order curve on more than 2
>>> dimensions, which may be what you want considering that it sounds like
>>> all your queries are in 3-dimensions.
>>>
>>>
>>> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA]<iezzi_adam@bah.com
>>> <mailto:iezzi_adam@bah.com>>  wrote:
>>>
>>>      I've been asked by my client to store a dataset which contains a
>>>      time series and geospatial coordinates (points) in Accumulo. At
>>>      the moment, we have a very dense data stored in Accumulo using the
>>>      following table schema:
>>>
>>>      Row ID:<geohash>_<reverse timestamp>
>>>
>>>      Family:<id>
>>>
>>>      Qualifier: attribute
>>>
>>>      Value:<value>
>>>
>>>      We are salting our RowID's with a geohash to prevent hot spotting.
>>>      When we query the data, we use a prefix scan (center tile and
>>>      eight neighbors), then using an Iterator to filter out the
>>>      outliers (points and time). Unfortunately, we've noticed some
>>>      performance issues with this approach in that it seems as the
>>>      initial prefix scan brings back a ton of data, forcing the
>>>      iterators to filter out a significant amount of outliers. E.g.
>>>      more than 50% is being filtered out, which seems inefficient to
>>>      us. Unfortunately for us, our users will always query by space and
>>>      time, making them equally important for each query. Because of the
>>>      time series component to our data, we're often bringing back a
>>>      significant amount of data for each given point. Each point can
>>>      have ten entries due to the time series, making our data set very
>>>      very dense.
>>>
>>>      The following are some options we're considering:
>>>
>>>      1. Salt a master table with an ID rather than the geohash
>>>      <id>_<reverse timestamp>, and then create a spatial index table.
>>>      If we choose this option, I assume we would scan the index first,
>>>      then use a batch scanner with the ID from the first query.
>>>      Unfortunately, I still see us filtering out a significant amount
>>>      of data using this approach.
>>>
>>>      2. Keep the table design as is, and maybe a RegExFilter via a
>>>      custom Iterator.
>>>
>>>      3. Do something completely different, such as use a Column Family
>>>      and the temporal aspect of the dataset together in some way.
>>>
>>>      Any advice or guidance would be greatly appreciated.
>>>
>>>      Thank you,
>>>
>>>      Adam
>>>
>>>
>>>
>>>
>>
>>
>
> --
>
> Kurt Christensen
> P.O. Box 811
> Westminster, MD 21158-0811
>
> ------------------------------**------------------------------**
> ------------
> "One of the penalties for refusing to participate in politics is that you
> end up being governed by your inferiors."
> --- Plato
>

```
Mime
View raw message