couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <paul.joseph.da...@gmail.com>
Subject Re: Proper Database Design and View Collation
Date Thu, 04 Sep 2008 01:46:02 GMT
Brian,

Well. I thought I had an answer. I've managed to thoroughly confuse
myself though. The original idea is basically emit one doc per sensor
reading and then use group=true on a reduced view and calculate the
data you need when all the passed in keys are equal. But looking at
logging output from the view I'm getting very confused by how this
beast works. So anyway, I'll throw this out there. It might work,
might not.

I'm especially concerned about what happens durring a rereduce. Right
now I haven't got enough brain power left to load a database with a
shitload of data to try and force it to happen (or if that would even
force it). Anyway, on to the brain dump:

1 Doc per sensor that contains everything except the readings attribute.
1 Doc per reading.

And the two docs would get type attributes to distinguish them.

Your first two views are fine as they don't require accessing the
reads attribute.

Tackling your fourth view, last_reading:

The map function would look something like:

function(doc)
{
    if(doc.type != "reading") return ;
    emit(doc.sensor_id, [doc.time_gathered, doc.temp, doc.pressure, ...etc]) ;
}

And the reduce:
function(keys,values,rereduce)
{
    if(keys == null) return null ; // Not sure why i'm getting null
for keys. This was the beginning of the doubt.

    for(var i = 1 ; i < keys.length ; i++)
    {
        if(keys[i][0] != keys[i-1][0])
        {
            return null ;
        }
    }

    var max_time = 0 ;
    for(var i = 1 ; i < values.length ; i++)
    {
        if(values[i][0] > values[max_time][0])
        {
            max_time = i ;
        }
    }

    return values[i] ;
}

And then the query is done like such:
http://localhost:5984/sensors/_view/readings/last?group=true

Using that pattern would be trivially updated for doing stddev calculations.

I mentioned being worried about rereduce. I have no idea what gets
passed in or what not so I haven't the slightest how that would work.
For finding the last reading, i *think* it might work as is.

The standard deviation stuff would be more tricky though. Basically
you'd have to use a single pass stddev algorithm which is pretty
simple, and then instead of returning just a stddev, you'd return a
structure that had the appropriate state information (num_samples,
mean, variance). Refer to [1] for a java implementation of single pass
stddev calculation if you haven't seen it before. And if you detected
a rereduce, you'd just combine the set of passed structures which
should theoretically be possible, but there'd be some trickery
involved.

[1] http://www.slamb.org/svn/repos/trunk/projects/common/src/java/org/slamb/common/stats/Sample.java


On Wed, Sep 3, 2008 at 8:50 PM, Brian Troutwine
<goofyheadedpunk@gmail.com> wrote:
> I'm currently using CouchDB to store time-series data, but am having
> difficulty conceptualizing a proper database design. In this email I
> will outline the system I would like to develop, summarize my current
> approach and give what I see to be its current defects. I would
> appreciate any comments and suggestions toward improving my
> implementation.
>
> As I said, I'm gathering data from a number of meteorological sensors.
> These devices take readings of various factors (ambient temperature,
> atmospheric pressure and relative humidity) on a fixed interval, say
> one per minute, and stores them until I am able to retrieve them.
> Currently an attempt is made once per hour, though the connection to
> an individual device is tenuous as best, so information concerning the
> last attempted retrieval and the last successful retrieval must be
> stored. Additionally, each sensor has a number of static attributes
> which I also store in CouchDB, such as the sensor's unique ID and GPS
> coordinates.
>
> I represent each sensor as a single document in CouchDB, storing the
> readings as documents, with timestamps, in a list. Here's an example:
>
>  {"sensor_id" : SENSOR01123,
>  "coordinate" : [46.209722, -122.192778],
>  "last_attempt" : 1220480444,
>  "last_update" : 1217887865,
>  "readings" : [{"time_gathered" : 1217023706,
>                 "temp" : 18,
>                 "pressure" : 102.311,
>                 "humidity" : 99},
>                ...,
>               ],
>  }
>
> I have four views: get_new_attempts, find_unresponsive,
> find_malfunctioning and last_reading. The first two are simple, they
> compare the last_attempt and last_update fields, respectively, to the
> current date, emitting sensor_id and coordinate. The third requires
> computing the standard deviation of the temperature, pressure and
> humidity measurements of all readings and emits the sensor_id of that
> sensor which has more than a fixed, acceptable deviation. As all the
> readings are stored in the sensor document this computation is,
> currently, a straight-forward iteration. The last creates emits the
> sensor_id as key and the data reading with the largest time_gathered
> as value.
>
> The main problem with this approach is the eventual size of the sensor
> document becomes quite large. I will exhaust my machine's ability to
> fit more than a few documents in memory in less than a month. Also,
> though I have read cmlenz's CouchDB "Joins", I do not see how I might
> go about writing the find_malfunctioning and last_reading views if I
> were to store readings as separate documents without modifying the
> return value of the views (I am loath to do that).
>
> Is it possible to store readings in separate documents and still
> maintain the functionality outlined above? If so, how might I go about
> doing that?
>
> Thanks,
> Brian
>
>
> --
> Brian
>

Mime
View raw message