couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aroj George <aro...@gmail.com>
Subject How best to workaround time filtering + aggregation
Date Thu, 16 Jun 2011 20:38:35 GMT
Hi All,

We have this requirement where we want to do aggregations over a set of time
series data after we have filtered it by time, in response to the time range
values input by a user.

Size: 10 million documents.
Aggregations to be applied: Sum, Count, Average, Max, Min, Latest etc..
Grouped By: Clinic Id,Field Name
*
Doc Structure:*
{ clinic id: 10, data :  { beds : 10, meds: 20, doctors : 30 }, timestamp :
T1 }

*View Structure:*
key: timestamp, clinic id, field name ( T1, 10, beds )
value: field value (10)

The above view will allow us to filter by a time range, but not group by
clinic id + field name, because of the timestamp being the first part of the
key.

So the only option it seems is to do the time filtering in Couch and do the
aggregations on the filtered set outside of Couch.

*Options:*

1.  *Python*
Load the filtered set in the client, (python) and use pythons's built in
reduce functions like sum, count etc.

Concerns: Loading a huge set of data into the python client may not be
efficient.

2. *List functions*
Use the Couch view to do time filter + do the aggregation in a list
function.
The key advantage here could be that everything is being done in Couch. This
not what a list function is meant for,
but given that aggregation via map/reduce is not possible in this case,
could this be a reasonable solution?

Concerns:
Will the javascript list function be slower than even doing the aggregation
in python?
Can I switch to an erlang list function to speed things up?

3. *SQL*
Load the data into a SQL table at regular intervals or using the _changes
view, with the below columns,
and use SQL's group by and sum,count functions to return the query results.

Entity ID, Field Name, Field Value, Timestamp

Select sum(field value) group by entity id, field name where timestamp > T1
and timestamp < T2

4. *Lucene*
I am not sure, but can something like Elastic Search (
http://www.elasticsearch.org/) or Solr help here?

5. *Multiple Queries*
Another option I could think of is, to change the view key to

*[clinic id, field name, timestamp]* ie add the timestamp at the end.

Now do a query for each clinic and field, like start_key = [clinic id, field
name, T1] and end_key = [clinic id, field name, T2] and group = 2.
This will give me the result, but I will have to do a query for (number of
clinics * number of fields) times.
Which again will be too inefficient because of that many HTTP calls. Unless
I can make all multiple such queries in a single HTTP GET?
Can I optimise this somehow?

What do you guys think, which of the above is the best option? How best
would you solve this use case?

The question essentially being, what's the most efficient  and scalable way
to perform real time aggregations on a set of data with time as filter
criteria?
The primary impact of the time filter being that we can't really do the
aggregations using the map reduce views.

Rgds,
Aroj

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message