accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Stephens ...@morphism.com>
Subject Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Tue, 25 Jun 2013 13:08:01 GMT
Adam & Co.,

Sorry to chime in late here.

One of our projects has similar requirements: queries based on time-space
constraints. (Tracking a particular entity through time and space is a
different requirement.)

We've used the following scheme with decent results.

Our basic approach is to use a 3D quadtree based on lat, lon, and time.
 Longitude and time are first transformed to make a quadtree key prefix
represent a cube (approximately).  Alternately roll your
own quadtree algorithm to give similar results.  So some number of prefix
bytes of a quadtree key represents an approximate time-space cube of
dimensions 1km x 1km x 1day.  Pick your time unit.  Another variation: use
a 3D geohash instead of a quadtree.

Then use the first N bytes of the key as the row ID and the remaining bytes
for the column qualifier.  Rationale: Sometimes there is virtue in keeping
points in a cube on the same tablet server.  (Or you might want to, say,
use only spatial key prefixes as row IDs.  Lots of flavors to consider.)

Disadvantages: You have to pick N and the time unit up front.  N and the
time unit are the basic index tuning parameters.  In our applications,
setting those parameters isn't too hard because we understand the data and
its uses pretty well.  However, as you've suggested, hotspots due to
concentrations can still be a problem.  We try to turn up N to adjust.

Variation: Use the military grid reference system (MGRS) grid zone
designator and square identifier as row ID and a quadtree-code numerical
location for the column qualifier.  Etc.

I'll see if I can get an example on github.

--Jamie


On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <klucar@gmail.com> wrote:

> Adam,
>
> Usually with geo-queries points of interest are pretty dense (as you've
> stated is your case). The indexing typically used (geohash or z-order) is
> efficient for points spread evenly across the earth, which isn't the
> typical case (think population density). One method I've heard (never
> actually tried myself) is to store points as distances from known
> locations. You can then find points close to each other by finding similar
> distances to 2 or 3 known locations. The known locations can then be
> created and distributed based on your expected point density allowing even
> dense areas to be spread evenly across a cluster.
>
> There's plenty of math, table design, and query design work to get it all
> working, but I think its feasible.
>
> Jim
>
>
> On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen <hoodel@hoodel.com>wrote:
>
>>
>> To clarify: By 'geologic', I was referring to time-scale (like 100s of
>> millions of years, with more detail near present, suggesting a log scale).
>>
>> Your use of id is surprising. Maybe I don't understand what you're trying
>> to do.
>> From what I was thinking, since you made reference to time-series, no
>> efficiency is gained through this id. If, instead the id were for a whole
>> time-series, and not each individual point then for each timestamp, you
>> would have X(id, timestamp), Y(id, timestamp) and whatever else (id,
>> timestamp) already organized as time series. ... all with the same row id.
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to filter these
>> points)
>> id, MEAS, name, vis, TIMESTAMP, named_measurement
>>
>> Alternately, if you wanted rich points, and not individual values:
>> bithash+id, INDEX, id, ... - (query to get a list of IDs intersecting
>> your space-time region)
>> id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all in one
>> column)
>>
>> If this is way off base from what you are trying to do, please ignore.
>>
>> Kurt
>>
>> -----
>>
>>
>> On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>>
>>> All,
>>>
>>> Thank you for all of the replies. To answer some of the questions:
>>>
>>> Q: You say you have point data. Are time series geographically fixed,
>>> with only the time dimension changing? ... or are the time series moving in
>>> space-time?
>>> A: The time series will be moving in space-time; therefore, the dataset
>>> is geologic.
>>>
>>> Q: If you have time series (identified by<id>) moving in space-time,
>>> then I would add an indirection.
>>> A: Our dataset is very similar to what you describe. Each geospatial
>>> point and time stamp is defined by an id.  Since I'm new to the Accumulo
>>> world, I'm not very familiar with this pattern/approach in table design.
>>> But, I will look around now that I have some guidance.
>>>
>>> Overall, I think I need to create a space-time hash of my dataset, but
>>> the biggest question I have is, "what time span do I use?". At the moment,
>>> I only have a years' worth of data; therefore, my MIN_DATE = Jan 01 and
>>> MAX_DATE = Dec 31. But we obviously expect this data to continue to grow;
>>> therefore, would want to account for additional data in the future.
>>>
>>> Thanks again for all of the guidance. I will digest some of the comments
>>> and will report back.
>>>
>>> Adam
>>>
>>> -----Original Message-----
>>> From: Kurt Christensen [mailto:hoodel@hoodel.com]
>>> Sent: Tuesday, June 18, 2013 8:54 PM
>>> To: user@accumulo.apache.org
>>> Subject: [External] Re: Storing, Indexing, and Querying data in Accumulo
>>> (geo + timeseries)
>>>
>>>
>>> An effective optimization strategy will be largely influenced by the
>>> nature of your data.
>>>
>>> You say you have point data. Are time series geographically fixed, with
>>> only the time dimension changing? ... or are the time series moving in
>>> space-time?
>>>
>>> I was going to suggest a 3-D approach, bit-interleaving your space and
>>> time [modulo timespan] together ( or point-tree, or octtree, or k-d trie,
>>> or r-d trie ). The trick there is to pick a time span large enough so that
>>> any interval you query is small relative to the time span, but small enough
>>> so that you don't waste a bunch (up to an eighth) of your usable hash
>>> values with no useful time data (i.e. populate your most significant bits).
>>> This would work if your data were geographically fixed, but changing only
>>> in time. If your time span is geologic, you might want to use a logarithmic
>>> time scale.
>>>
>>> If you have time series (identified by<id>) moving in space-time, then
I
>>> would add an indirection. Use the space-time hash to determine the IDs
>>> intersecting your zone and then query again, using the IDs to pull out the
>>> time series, filtering with your interator, perhaps using the native
>>> timestamp field.
>>>
>>> I hope that helps. Good luck.
>>>
>>> Kurt
>>>
>>> BTW: 50% filtering isn't really that inefficient. - kkc
>>>
>>>
>>> On 6/18/13 12:36 AM, Jared Winick wrote:
>>>
>>>
>>>> Have you considered a "geohash" of all 3 dimensions together and using
>>>> that as the RowID? I have never implemented a geohash exactly, but I
>>>> do know it is possible to build a z-order curve on more than 2
>>>> dimensions, which may be what you want considering that it sounds like
>>>> all your queries are in 3-dimensions.
>>>>
>>>>
>>>> On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam [USA]<iezzi_adam@bah.com
>>>> <mailto:iezzi_adam@bah.com>>  wrote:
>>>>
>>>>      I've been asked by my client to store a dataset which contains a
>>>>      time series and geospatial coordinates (points) in Accumulo. At
>>>>      the moment, we have a very dense data stored in Accumulo using the
>>>>      following table schema:
>>>>
>>>>      Row ID:<geohash>_<reverse timestamp>
>>>>
>>>>      Family:<id>
>>>>
>>>>      Qualifier: attribute
>>>>
>>>>      Value:<value>
>>>>
>>>>      We are salting our RowID's with a geohash to prevent hot spotting.
>>>>      When we query the data, we use a prefix scan (center tile and
>>>>      eight neighbors), then using an Iterator to filter out the
>>>>      outliers (points and time). Unfortunately, we've noticed some
>>>>      performance issues with this approach in that it seems as the
>>>>      initial prefix scan brings back a ton of data, forcing the
>>>>      iterators to filter out a significant amount of outliers. E.g.
>>>>      more than 50% is being filtered out, which seems inefficient to
>>>>      us. Unfortunately for us, our users will always query by space and
>>>>      time, making them equally important for each query. Because of the
>>>>      time series component to our data, we're often bringing back a
>>>>      significant amount of data for each given point. Each point can
>>>>      have ten entries due to the time series, making our data set very
>>>>      very dense.
>>>>
>>>>      The following are some options we're considering:
>>>>
>>>>      1. Salt a master table with an ID rather than the geohash
>>>>      <id>_<reverse timestamp>, and then create a spatial index
table.
>>>>      If we choose this option, I assume we would scan the index first,
>>>>      then use a batch scanner with the ID from the first query.
>>>>      Unfortunately, I still see us filtering out a significant amount
>>>>      of data using this approach.
>>>>
>>>>      2. Keep the table design as is, and maybe a RegExFilter via a
>>>>      custom Iterator.
>>>>
>>>>      3. Do something completely different, such as use a Column Family
>>>>      and the temporal aspect of the dataset together in some way.
>>>>
>>>>      Any advice or guidance would be greatly appreciated.
>>>>
>>>>      Thank you,
>>>>
>>>>      Adam
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>>
>> Kurt Christensen
>> P.O. Box 811
>> Westminster, MD 21158-0811
>>
>> ------------------------------**------------------------------**
>> ------------
>> "One of the penalties for refusing to participate in politics is that you
>> end up being governed by your inferiors."
>> --- Plato
>>
>
>

Mime
View raw message