incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Blair Nilsson <blair.nils...@gmail.com>
Subject Re: using couchdb for access log analysis
Date Wed, 17 Jun 2009 22:48:00 GMT
we are doing something pretty much the same here, doing aggregation of
data (we are just after the max values, but you can use it for pretty
much anything) for a bunch of solar water heating units.
overall it works just fine :)

Here are the maps and reduces, you may want to adapt them or write your own.

the map...

function(doc) {
  var displayDate = function(date) {
       var year = date.getFullYear();
       var month = date.getMonth()+1;
       month = ((month < 10) ? "0" : "") + month
       var day = date.getDate();
       day = ((day < 10) ? "0" : "") + day
       var hours = date.getHours()
       hours = ((hours < 10) ? "0" : "") + hours
       var minutes = date.getMinutes()
       minutes=((minutes < 10) ? "0" : "") + minutes
       var seconds = date.getSeconds()
       seconds=((seconds < 10) ? "0" : "") + seconds
       return year+"/"+month+"/"+day+" "+hours+":" + minutes+":"+seconds
  }

  if (doc.type="Solar Reading") {
       var date = new Date(doc.date)
       date.setSeconds(0)
       emit([doc.site, 1, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setMinutes(Math.floor(date.getMinutes()/10) * 10)
       emit([doc.site, 2, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setMinutes(0)
       emit([doc.site, 3, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setHours(0)
       emit([doc.site, 4, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
       date.setDate(1)
       emit([doc.site, 5, displayDate(date)], [doc.panel, doc.inlet,
doc.outlet])
  }
}


the reduce... - note, you end up in rereduce land pretty quickly as we
found, happily, this is just fine for it.

 function(keys, values, rereduce) {
	var panel = values[0][0]
	var inlet = values[0][1]
	var outlet = values[0][2]
	for (i in values) {
		if (values[i][0] > panel) {
			panel = values[i][0]
		}

		if (values[i][1] > inlet) {
			inlet = values[i][1]
		}

		if (values[i][2] > outlet) {
			outlet = values[i][2]
		}
	}
	return [panel, inlet, outlet]
}




On Thu, Jun 18, 2009 at 5:34 AM, Seth Falcon<seth@userprimary.net> wrote:
> Hi all,
>
> I've been exploring using couchdb to store aggregated access log
> data.  I would really appreciate a bit of feedback on the approach
> I've taken.
>
> First, the problem I'm trying solve.  The input data consists of web
> server access logs.  I want to generate a report of how many times
> each page was viewed over different time windows.  For example, I'd
> like to be able to request the report for the day 2009-06-16 as well
> as a particular hour in a given day.
>
> After reading the wiki page about stats aggregation, I started by
> pre-reducing into one minute chunks.  Each one minutes worth of log
> data results in a list of pages and view counts over that minute.  For
> each page/count pair I insert a document like the following into
> couchdb:
>
>     {
>        "_id" : "2009-06-16T13h30_abcdefg",
>        "url" : "/foo/bar/baz",
>        "view_count" : 13
>     }
>
> The ID is the timestamp representing the minute chunk prepended to the
> md5 digest of the url.
>
> One idea for a view that will allow querying different time units is
> to emit keys like the following for the above example doc:
>
>    ["hour",  "2009-06-16T13", "abcdefg"]
>    ["day",   "2009-06-16",    "abcdefg"]
>    ["month", "2009-06",       "abcdefg"]
>
> Then one can query with startkey=["hour", "2009-06-16T13", true] and
> endkey=["hour", "2009-06-16T13", {}] to get a particular hour using
> group=true and a reduce function that sums the view_count.
>
> Does this seem like a reasonable approach?  Would it be better to
> create separate views for hour, day, and month and avoid the
> array-valued keys?  In a previous post, I goofed up in understanding
> how startkey/endkey queries work.  Am I making a similar error in
> thinking with the above array-valued key approach?  I'm thinking this
> is a different case because the time unit is always an exact match in
> startkey and endkey.
>
> Anyhow, I'd really appreciate any suggestions for improvement or
> confirmation that this should kinda-sorta work.
>
> Cheers,
>
> + seth
>

I'm not sure if what we are doing is better or worse, but here is the
approch we are using.

Mime
View raw message