accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kurt Christensen <hoo...@hoodel.com>
Subject Re: [External] Re: Storing, Indexing, and Querying data in Accumulo (geo + timeseries)
Date Tue, 25 Jun 2013 14:52:49 GMT

I thought I might too chime in late. I think we're talking about the 
same thing, with perhaps different encoding.

Yes. In the bit-interleaving scheme I mentioned, each 3-bits of the hash 
is equivalent to a level in an oct-tree ("3D quadtree"). ... and yes, 
there is a trick to picking the time-scales right.

-- Kurt


On 6/25/13 9:08 AM, Jamie Stephens wrote:
> Adam & Co.,
>
> Sorry to chime in late here.
>
> One of our projects has similar requirements: queries based on 
> time-space constraints. (Tracking a particular entity through time and 
> space is a different requirement.)
>
> We've used the following scheme with decent results.
>
> Our basic approach is to use a 3D quadtree based on lat, lon, and 
> time.  Longitude and time are first transformed to make a quadtree key 
> prefix represent a cube (approximately).  Alternately roll your 
> own quadtree algorithm to give similar results.  So some number of 
> prefix bytes of a quadtree key represents an approximate time-space 
> cube of dimensions 1km x 1km x 1day.  Pick your time unit.  Another 
> variation: use a 3D geohash instead of a quadtree.
>
> Then use the first N bytes of the key as the row ID and the remaining 
> bytes for the column qualifier.  Rationale: Sometimes there is virtue 
> in keeping points in a cube on the same tablet server.  (Or you might 
> want to, say, use only spatial key prefixes as row IDs.  Lots of 
> flavors to consider.)
>
> Disadvantages: You have to pick N and the time unit up front.  N and 
> the time unit are the basic index tuning parameters.  In our 
> applications, setting those parameters isn't too hard because we 
> understand the data and its uses pretty well.  However, as you've 
> suggested, hotspots due to concentrations can still be a problem.  We 
> try to turn up N to adjust.
>
> Variation: Use the military grid reference system (MGRS) grid zone 
> designator and square identifier as row ID and a quadtree-code 
> numerical location for the column qualifier.  Etc.
>
> I'll see if I can get an example on github.
>
> --Jamie
>
>
> On Mon, Jun 24, 2013 at 9:47 AM, Jim Klucar <klucar@gmail.com 
> <mailto:klucar@gmail.com>> wrote:
>
>     Adam,
>
>     Usually with geo-queries points of interest are pretty dense (as
>     you've stated is your case). The indexing typically used (geohash
>     or z-order) is efficient for points spread evenly across the
>     earth, which isn't the typical case (think population density).
>     One method I've heard (never actually tried myself) is to store
>     points as distances from known locations. You can then find points
>     close to each other by finding similar distances to 2 or 3 known
>     locations. The known locations can then be created and distributed
>     based on your expected point density allowing even dense areas to
>     be spread evenly across a cluster.
>
>     There's plenty of math, table design, and query design work to get
>     it all working, but I think its feasible.
>
>     Jim
>
>
>     On Wed, Jun 19, 2013 at 1:07 AM, Kurt Christensen
>     <hoodel@hoodel.com <mailto:hoodel@hoodel.com>> wrote:
>
>
>         To clarify: By 'geologic', I was referring to time-scale (like
>         100s of millions of years, with more detail near present,
>         suggesting a log scale).
>
>         Your use of id is surprising. Maybe I don't understand what
>         you're trying to do.
>         >From what I was thinking, since you made reference to
>         time-series, no efficiency is gained through this id. If,
>         instead the id were for a whole time-series, and not each
>         individual point then for each timestamp, you would have X(id,
>         timestamp), Y(id, timestamp) and whatever else (id, timestamp)
>         already organized as time series. ... all with the same row id.
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, POSITION, XY, vis, TIMESTAMP, (x,y) - (use iterators to
>         filter these points)
>         id, MEAS, name, vis, TIMESTAMP, named_measurement
>
>         Alternately, if you wanted rich points, and not individual values:
>         bithash+id, INDEX, id, ... - (query to get a list of IDs
>         intersecting your space-time region)
>         id, SAMPLE, (x,y), vis, TIMESTAMP, sampleObject(JSON?) - (all
>         in one column)
>
>         If this is way off base from what you are trying to do, please
>         ignore.
>
>         Kurt
>
>         -----
>
>
>         On 6/18/13 10:14 PM, Iezzi, Adam [USA] wrote:
>
>             All,
>
>             Thank you for all of the replies. To answer some of the
>             questions:
>
>             Q: You say you have point data. Are time series
>             geographically fixed, with only the time dimension
>             changing? ... or are the time series moving in space-time?
>             A: The time series will be moving in space-time;
>             therefore, the dataset is geologic.
>
>             Q: If you have time series (identified by<id>) moving in
>             space-time, then I would add an indirection.
>             A: Our dataset is very similar to what you describe. Each
>             geospatial point and time stamp is defined by an id.
>              Since I'm new to the Accumulo world, I'm not very
>             familiar with this pattern/approach in table design. But,
>             I will look around now that I have some guidance.
>
>             Overall, I think I need to create a space-time hash of my
>             dataset, but the biggest question I have is, "what time
>             span do I use?". At the moment, I only have a years' worth
>             of data; therefore, my MIN_DATE = Jan 01 and MAX_DATE =
>             Dec 31. But we obviously expect this data to continue to
>             grow; therefore, would want to account for additional data
>             in the future.
>
>             Thanks again for all of the guidance. I will digest some
>             of the comments and will report back.
>
>             Adam
>
>             -----Original Message-----
>             From: Kurt Christensen [mailto:hoodel@hoodel.com
>             <mailto:hoodel@hoodel.com>]
>             Sent: Tuesday, June 18, 2013 8:54 PM
>             To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>             Subject: [External] Re: Storing, Indexing, and Querying
>             data in Accumulo (geo + timeseries)
>
>
>             An effective optimization strategy will be largely
>             influenced by the nature of your data.
>
>             You say you have point data. Are time series
>             geographically fixed, with only the time dimension
>             changing? ... or are the time series moving in space-time?
>
>             I was going to suggest a 3-D approach, bit-interleaving
>             your space and time [modulo timespan] together ( or
>             point-tree, or octtree, or k-d trie, or r-d trie ). The
>             trick there is to pick a time span large enough so that
>             any interval you query is small relative to the time span,
>             but small enough so that you don't waste a bunch (up to an
>             eighth) of your usable hash values with no useful time
>             data (i.e. populate your most significant bits). This
>             would work if your data were geographically fixed, but
>             changing only in time. If your time span is geologic, you
>             might want to use a logarithmic time scale.
>
>             If you have time series (identified by<id>) moving in
>             space-time, then I would add an indirection. Use the
>             space-time hash to determine the IDs intersecting your
>             zone and then query again, using the IDs to pull out the
>             time series, filtering with your interator, perhaps using
>             the native timestamp field.
>
>             I hope that helps. Good luck.
>
>             Kurt
>
>             BTW: 50% filtering isn't really that inefficient. - kkc
>
>
>             On 6/18/13 12:36 AM, Jared Winick wrote:
>
>                 Have you considered a "geohash" of all 3 dimensions
>                 together and using
>                 that as the RowID? I have never implemented a geohash
>                 exactly, but I
>                 do know it is possible to build a z-order curve on
>                 more than 2
>                 dimensions, which may be what you want considering
>                 that it sounds like
>                 all your queries are in 3-dimensions.
>
>
>                 On Mon, Jun 17, 2013 at 7:56 PM, Iezzi, Adam
>                 [USA]<iezzi_adam@bah.com <mailto:iezzi_adam@bah.com>
>                 <mailto:iezzi_adam@bah.com
>                 <mailto:iezzi_adam@bah.com>>>  wrote:
>
>                      I've been asked by my client to store a dataset
>                 which contains a
>                      time series and geospatial coordinates (points)
>                 in Accumulo. At
>                      the moment, we have a very dense data stored in
>                 Accumulo using the
>                      following table schema:
>
>                      Row ID:<geohash>_<reverse timestamp>
>
>                      Family:<id>
>
>                      Qualifier: attribute
>
>                      Value:<value>
>
>                      We are salting our RowID's with a geohash to
>                 prevent hot spotting.
>                      When we query the data, we use a prefix scan
>                 (center tile and
>                      eight neighbors), then using an Iterator to
>                 filter out the
>                      outliers (points and time). Unfortunately, we've
>                 noticed some
>                      performance issues with this approach in that it
>                 seems as the
>                      initial prefix scan brings back a ton of data,
>                 forcing the
>                      iterators to filter out a significant amount of
>                 outliers. E.g.
>                      more than 50% is being filtered out, which seems
>                 inefficient to
>                      us. Unfortunately for us, our users will always
>                 query by space and
>                      time, making them equally important for each
>                 query. Because of the
>                      time series component to our data, we're often
>                 bringing back a
>                      significant amount of data for each given point.
>                 Each point can
>                      have ten entries due to the time series, making
>                 our data set very
>                      very dense.
>
>                      The following are some options we're considering:
>
>                      1. Salt a master table with an ID rather than the
>                 geohash
>                 <id>_<reverse timestamp>, and then create a spatial
>                 index table.
>                      If we choose this option, I assume we would scan
>                 the index first,
>                      then use a batch scanner with the ID from the
>                 first query.
>                      Unfortunately, I still see us filtering out a
>                 significant amount
>                      of data using this approach.
>
>                      2. Keep the table design as is, and maybe a
>                 RegExFilter via a
>                      custom Iterator.
>
>                      3. Do something completely different, such as use
>                 a Column Family
>                      and the temporal aspect of the dataset together
>                 in some way.
>
>                      Any advice or guidance would be greatly appreciated.
>
>                      Thank you,
>
>                      Adam
>
>
>
>
>
>         -- 
>
>         Kurt Christensen
>         P.O. Box 811
>         Westminster, MD 21158-0811
>
>         ------------------------------------------------------------------------
>         "One of the penalties for refusing to participate in politics
>         is that you end up being governed by your inferiors."
>         --- Plato
>
>
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that 
you end up being governed by your inferiors."
--- Plato

Mime
View raw message