incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Cottlehuber <...@jsonified.com>
Subject Re: CouchDB document design for facebook insights
Date Fri, 07 Dec 2012 09:08:44 GMT
On 7 December 2012 09:32, Dmitriy Fot <dmitriy.fot@gmail.com> wrote:
> Hi All,
>
> As many other, I am new to CouchDB, and therefore not sure about the
> proper usage of this technology. Especially when it comes to the
> design of a document and views.

Welcome & great questions :-)

> I am going to use CouchDB for analytical information based on Facebook
> insights and other sources. We are going to collect the analytical
> information overtime and keep it forever, then, of course, we would
> like to build analytical reports based on this information.

Excellent - this situation where you have a dataset that may no longer
fit in memory is a great fit for Couch. "Throw it in, figure it out
later".

A pro tip is to consider turning some of the live views into documents
(kinda like a pre-cooked partial reduce). e.g. total number of X/month
could easily be stored as an aggregate doc, and then store the
aggregates separately from the large full archive db.

> My main concern is a proper design of a document as we are going to
> have millions of them. And, If possible, I would like more experienced
> CouchDB users to see it and warn me if I am about to make a big
> mistake.
>
> The proposed design of a document:
>
> {
>    "_id": "0b69a33807d4cb63680dbebc16000af5",

If you can, make the id something meaningful. there's nothing wrong
with using the built-in id generator, but its the only enforced unique
identifier (within a single node). make the most of it!

>    "_rev": "1-7c9916592c377e32cf83acf746a8647c",
>    "metrics": [       //array of metrics, one element per facebook
> page, around 10 pages per document
>        {
>            "sourceId": "210627525692699", //facebook page ID
>            "source": "facebook",
>            "values": {
>                "page_likes": 53
>                //many more other metrics, around 100
>            }
>        },
>        {
>            "sourceId": "354413697924499", // //facebook page ID
>            "source": "facebook",
>            "values": {
>                "page_wall_posts_source_unique": {other: 0, composer: 1},
>                "page_likes": 12
>                //many more other metrics, around 100
>            }
>        }
>    ],
>    "timestamp": [
>        2012,
>        10,
>        15,
>        10,
>        0,
>        0
>    ],

No reason why you can't store the timestamp in multiple formats. The
expanded form is very easy to use for group levels, otherwise I'd
probably just skip it and use either seconds since epoch start (space
efficient and fast comparisons), or a ISO 8601 standard format (human
readable and sorts the way you'd expect). I believe date calculations
& parsing in JS are relatively slow, it will only be relevant if a lot
of your views change.

>    "customerId": "71ff942f-9283-4916-ab84-4927bce09117"
> }
>
> Expected number of documents: +10 000 every hour, +240 000 every day.

BigCouch. In the near future we plan to merge these so the distinction
will not be important.

> Expected requests to the documents:
> - sum of values per customer, per sourceId, per metric in a given time period
> - specialized views for more complex metrics
>
>
> Questions:
> - In order to get analytics for some complex metrics (like
> page_wall_posts_source_unique) we will need to build specialized
> views, probably many of them, should I expect problems with view
> update time?

This is a function of size and type of hardware, number of shards,

With some testing you should be able to find a reasonable compromise.

Refer to the note at the top, about keep aggregate stats in a more
"live" db without having very long view build times.

> - Is it right decision to use an array for the timestamp, or it is
> better to use a long?
> - Should I use one design document or put every view in a new one?

There are some space optimisations on the resulting view(s) if you
have a single ddoc or split them. I think you're likely to do a bit of
both.

the main catch is that if you update view A in ddoc A then all views
in ddoc A need to be rebuilt. Even if the other views don't actually
change.

> any comment is appreciated, thank you

A+
Dave

Mime
View raw message