From commits-return-7361-apmail-couchdb-commits-archive=couchdb.apache.org@couchdb.apache.org Wed Nov 23 12:44:55 2011
Return-Path:
X-Original-To: apmail-couchdb-commits-archive@www.apache.org
Delivered-To: apmail-couchdb-commits-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 2F8907B9A
for ; Wed, 23 Nov 2011 12:44:55 +0000 (UTC)
Received: (qmail 3410 invoked by uid 500); 23 Nov 2011 12:44:55 -0000
Delivered-To: apmail-couchdb-commits-archive@couchdb.apache.org
Received: (qmail 3365 invoked by uid 500); 23 Nov 2011 12:44:55 -0000
Mailing-List: contact commits-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list commits@couchdb.apache.org
Received: (qmail 3358 invoked by uid 500); 23 Nov 2011 12:44:55 -0000
Delivered-To: apmail-incubator-couchdb-commits@incubator.apache.org
Received: (qmail 3355 invoked by uid 99); 23 Nov 2011 12:44:55 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 12:44:55 +0000
X-ASF-Spam-Status: No, hits=-2000.0 required=5.0
tests=ALL_TRUSTED
X-Spam-Check-By: apache.org
Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 12:44:49 +0000
Received: from eos.apache.org (localhost [127.0.0.1])
by eos.apache.org (Postfix) with ESMTP id A53BCAEB;
Wed, 23 Nov 2011 12:44:27 +0000 (UTC)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki
To: Apache Wiki
Date: Wed, 23 Nov 2011 12:44:27 -0000
Message-ID: <20111123124427.65770.91750@eos.apache.org>
Subject: =?utf-8?q?=5BCouchdb_Wiki=5D_Update_of_=22View=5FSnippets=22_by_RobertNew?=
=?utf-8?q?son?=
Auto-Submitted: auto-generated
X-Virus-Checked: Checked by ClamAV on apache.org
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Couchdb Wiki" for c=
hange notification.
The "View_Snippets" page has been changed by RobertNewson:
http://wiki.apache.org/couchdb/View_Snippets?action=3Ddiff&rev1=3D42&rev2=
=3D43
Comment:
Remove misleading or incomplete examples
If you then want to know the total count for each parent, you can use the=
''group_level'' view parameter:
''startkey=3D[''''''"thisparent"]&endkey=3D["thisparent",{}]&inclusive_en=
d=3Dfalse&group_level=3D1''
=
- =
- =3D=3D Retrieve the top N tags. =3D=3D
- =
- This snippet assumes your docs have a top level tags element that is an a=
rray of strings, theoretically it'd work with an array of anything, but it =
hasn't been tested as such.
- =
- Use a standard counting emit function:
- =
- {{{#!highlight javascript
- function(doc) {
- for(var idx in doc.tags) {
- emit(doc.tags[idx], 1);
- }
- }
- }}}
- =
- Notice that `MAX` is the number of tags to return. Technically this snipp=
et relies on an implementation artifact that CouchDB will send keys in sort=
ed order to the reduce functions, thus it'd break subtly if this stopped be=
ing true. Buyer beware!
- =
- {{{#!highlight javascript
- function(keys, values, rereduce)
- {
- var MAX =3D 3;
- =
- /*
- Basically we're just kind of faking a priority queue. We
- do have one caveat in that we may process a single key
- across reduce calls. I'm reasonably certain that even so
- we'll still be processing keys in collation order in
- which case we can just keep the last key from the previous
- non-rereduce in our return value. Should work itself out
- in the rereduces though when we don't keep the extras
- around.
- */
- =
- var tags =3D {};
- var lastkey =3D null;
- if(!rereduce)
- {
- /*
- I could probably alter the view output to produce
- a slightly different output so that this code
- could get pushed into the same code as below, but
- I figure that the view output might be used for
- other reduce functions.
- =
- This just creates an object {tag1: N, tag2: M, ...}
- */
- =
- for(var k in keys)
- {
- if(tags[keys[k][0]]) tags[keys[k][0]] +=3D values[k];
- else tags[keys[k][0]] =3D values[k];
- }
- lastkey =3D keys[keys.length-1][0];
- }
- else
- {
- /*
- This just takes a collection of objects that have
- (tag, count) key/value pairs and merges into a
- single object.
- */
- tags =3D values[0];
- for(var v =3D 1; v < values.length; v++)
- {
- for(var t in values[v])
- {
- if(tags[t]) tags[t] +=3D values[v][t];
- else tags[t] =3D values[v][t];
- }
- }
- }
- =
- /*
- This code just removes the tags that are out of
- the top N tags. When re-reduce is false we may
- keep the last key passed to use because its
- possible that we only processed part of it's
- data.
- */
- var top =3D [];
- for(var t in tags){top[top.length] =3D [t, tags[t]];}
- function sort_tags(a, b) {return b[1] - a[1];}
- top.sort(sort_tags);
- for(var n =3D MAX; n < top.length; n++)
- {
- if(top[n][0] !=3D lastkey) tags[top[n][0]] =3D undefined;
- }
- =
- // And done.
- return tags;
- }
- }}}
- =
- There's probably a more efficient method to get the priority queue stuff,=
but I was going for simplicity over speed.
- =
- When querying this reduce you should not use the `group` or `group_level`=
query string parameters. The returned reduce value will be an object with =
the top `MAX` tag: count pairs.
- =
- =
- =3D=3D Joining an aggregate sum along with related data =3D=3D
- =
- Here is a modified example from the [[View_collation|View collation]] pag=
e. Note that `group_level` needs to be set to `1` for it to return a meani=
ngful `customer_details`.
- =
- {{{#!highlight javascript
- // Map function
- function(doc) {
- if (doc.Type =3D=3D "customer") {
- emit([doc._id, 0], doc);
- } else if (doc.Type =3D=3D "order") {
- emit([doc.customer_id, 1], doc);
- }
- }
- =
- // Reduce function
- // Only produces meaningful output.customer_details if group_level >=3D 1
- function(keys, values, rereduce) {
- var output =3D {};
- if (rereduce) {
- for (idx in values) {
- if (values[idx].total !=3D=3D undefined) {
- output.total +=3D values[idx].total;
- } else if (values[idx].customer_details !=3D=3D undefined) {
- output.customer_details =3D values[idx].customer_details;
- }
- }
- } else {
- for (idx in values) {
- if (values[idx].Type =3D=3D "customer") output.customer_details =3D=
doc;
- else if (values[idx].Type =3D=3D "order") output.total +=3D 1;
- }
- }
- return output;
- }
- }}}
- =
- =
- =3D=3D Computing the standard deviation =3D=3D
- This example is from the couchdb test-suite. It is '''much''' easier and =
less complex then following example although it does not calculate min,max =
and mean (but this should be an easy exercise).
- =
- {{{#!highlight javascript
- // Map
- function (doc) {
- emit(doc.val, doc.val)
- }
- }}}
- =
- {{{#!highlight javascript
- // Reduce
- function (keys, values, rereduce) {
- // This computes the standard deviation of the mapped results
- var stdDeviation=3D0.0;
- var count=3D0;
- var total=3D0.0;
- var sqrTotal=3D0.0;
- =
- if (!rereduce) {
- // This is the reduce phase, we are reducing over emitted values fr=
om
- // the map functions.
- for(var i in values) {
- total =3D total + values[i];
- sqrTotal =3D sqrTotal + (values[i] * values[i]);
- }
- count =3D values.length;
- }
- else {
- // This is the rereduce phase, we are re-reducing previosuly
- // reduced values.
- for(var i in values) {
- count =3D count + values[i].count;
- total =3D total + values[i].total;
- sqrTotal =3D sqrTotal + values[i].sqrTotal;
- }
- }
- =
- var variance =3D (sqrTotal - ((total * total)/count)) / count;
- stdDeviation =3D Math.sqrt(variance);
- =
- // the reduce result. It contains enough information to be rereduced
- // with other reduce results.
- return {"stdDeviation":stdDeviation,"count":count,
- "total":total,"sqrTotal":sqrTotal};
- }; =
- }}}
- =
- =
- =
- =3D=3D Computing simple summary statistics (min,max,mean,standard deviati=
on) =3D=3D
- =
- This implementation of standard deviation is more complex than the above =
algorithm, called the "textbook one-pass algorithm" by Chan, Golub, and Le`=
`Veque. While it is mathematically equivalent to the standard two-pass com=
putation of standard deviation, it can be numerically unstable under certai=
n conditions. Specifically, if the square of the sums and the sum of the =
squares terms are large, then they will be computed with some rounding erro=
r. If the variance of the data set is small, then subtracting those two la=
rge numbers (which have been rounded off slightly) might wipe out the compu=
tation of the variance. See http://www.jstor.org/stable/2683386, http://pe=
ople.xiph.org/~tterribe/notes/homs.html, and the wikipedia description of K=
nuth's algorithm http://en.wikipedia.org/wiki/Algorithms_for_calculating_va=
riance.
- =
- The below implementation in {{{JavaScript}}} by MarcaJames. Any mistakes=
in the js coding are my fault. The algorithms are from others (all smarte=
r than I), as noted in the comments in the code. To the best of my knowled=
ge the algorithms are public domain, and my implementation freely available=
to all. =
- =
- Note that the view is specialized to my dataset, but the reduce function =
is written to be fairly generic. I kept the view as is because I'm too laz=
y to write up a generic view, and also because when I wrote it I wasn't sur=
e one could use Date, Math, and Reg``Exp in Couch``DB Java``Script. =
- =
- {{{#!highlight javascript
- // Map function
- function(doc) {
- var risk_exponent =3D =
- -3.194 +
- doc.CV_VOLOCC_1 *1.080 +
- doc.CV_VOLOCC_M *0.627 +
- doc.CV_VOLOCC_R *0.553 +
- doc.CORR_VOLOCC_1M *1.439 +
- doc.CORR_VOLOCC_MR *0.658 +
- doc.LAG1_OCC_M *0.412 +
- doc.LAG1_OCC_R *1.424 +
- doc.MU_VOL_1 *0.038 +
- doc.MU_VOL_M *0.100 +
- doc["CORR_OCC_1M X MU_VOL_M"] *-0.168 +
- doc["CORR_OCC_1M X SD_VOL_R" ] *0.479 +
- doc["CORR_OCC_1M X LAG1_OCC_R"] *-1.462 ;
- =
- var risk =3D Math.exp(risk_exponent);
- =
- // parse the date and "chunk" it up
- var pattern =3D new RegExp("(.*)-0?(.*)-0?(.*)T0?(.*):0?(.*):0?(.*)(-=
0800)");
- var result =3D pattern.exec(doc.EstimateTime);
- var day;
- if(result){
- //new Date(year, month, day, hours, minutes, seconds, ms)
- // force rounding to 5 minutes, 0 seconds, for aggregation of 5 m=
inute chunks
- var fivemin =3D 5 * Math.floor(result[5]/5)
- day =3D new Date(result[1],result[2]-1,result[3],result[4], fivem=
in, 0);
- }
- var weekdays =3D ["Sun","Mon","Tue","Wed","Thu","Fri","Sat"];
- emit([weekdays[day.getDay()],day.toLocaleTimeString( )],{'risk':risk}=
);
- }
- =
- // Reduce function
- function (keys, values, rereduce) {
- =
- // algorithm for on-line computation of moments from =
- // =
- // Tony F. Chan, Gene H. Golub, and Randall J. LeVeque: "Updating
- // Formulae and a Pairwise Algorithm for Computing Sample
- // Variances." Technical Report STAN-CS-79-773, Department of
- // Computer Science, Stanford University, November 1979. url:
- // ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-=
79-773.pdf
- =
- // so there is some weirdness in that the original was Fortran, index=
from 1,
- // and lots of arrays (no lists, no hash tables)
- =
- =
- // also consulted http://people.xiph.org/~tterribe/notes/homs.html
- // and http://www.jstor.org/stable/2683386
- // and (ick!) the wikipedia description of Knuth's algorithm
- // to clarify what was going on with http://www.slamb.org/svn/repos/t=
runk/projects/common/src/java/org/slamb/common/stats/Sample.java
- =
- /* =
- combine the variance esitmates for two partitions, A and B.
- partitionA and partitionB both should contain
- { S : the current estimate of the second moment
- Sum : the sum of observed values
- M : the number of observations used in the partition to calcula=
te S and Sum
- }
- =
- The output will be an identical object, containing the S, Sum and
- M for the combination of partitions A and B
- =
- This routine is derived from original fortran code in Chan et al,
- (1979)
- =
- But it is easily derived by recognizing that all you're doing is
- multiplying each partition's S and Sum by its respective count M,
- and then dividing by the new count Ma + Mb. The arrangement of
- the diff etc is just rearranging terms to make it look nice.
- =
- And then summing up the sums, and summing up the counts
- =
- */
- function combine_S(partitionA,partitionB){
- var NewS=3DpartitionA.S;
- var NewSum=3DpartitionA.Sum;
- var min =3D partitionA.min;
- var max =3D partitionA.max;
- var M =3D partitionB.M;
- if(!M){M=3D0;}
- if(M){
- var diff =3D ((partitionA.M * partitionB.Sum / partitionB.M) - part=
itionA.Sum );
- =
- NewS +=3D partitionB.S + partitionB.M*diff*diff/(partitionA.M * (part=
itionA.M+partitionB.M) );
- NewSum +=3D partitionB.Sum ;
- =
- min =3D Math.min(partitionB.min, min);
- max =3D Math.max(partitionB.max, max);
- }
- return {'S':NewS,'Sum':NewSum, 'M': partitionA.M+M, 'min':min, 'max':max =
};
- }
- =
- =
- /*
- =
- This routine is derived from original fortran code in Chan et al,
- (1979), with the combination step split out above to allow that to
- be called independently in the rereduce step.
- =
- Arguments: =
- =
- The first argument (values) is an array of objects. The
- assumption is that the key to the variable of interest is 'risk'.
- If this is not the case, the seventh argument should be the correct
- key to use. More complicated data structures are not supported.
- =
- The second, third, and fourth arguments are in case this is a
- running tally. You can pass in exiting values for M (the number
- of observations already processed), Sum (the running sum of those
- M observations) and S (the current estimate of variance for those
- M observations). Totally optional, defaulting to zero. =
- =
- The fifth parameter is for the running min, and the sixth for the
- max.
- =
- Pass "null" for parameters 2 through 6 if you need to pass a key in =
the
- seventh slot.
- =
- Some notes on the algorithm. There is a precious bit of trickery
- with stack pointers, etc that make for a minimal amount of
- temporary storage. All this was included in the original
- algorithm. I can't see that it makes much sense to include all
- that effort given that I've got gobs of RAM and am instead most
- likely processor bound, but it reminded me of programming in
- assembly so I kept it in. =
- =
- If you watch the progress of this algorithm in a debugger or
- firebug, you'll see that the size of the stack stays pretty small,
- with the bottom (0) entry staying at zero, then the [1] entry
- containing a power of two (2,4,8,16, etc), and the [2] entry
- containing the next power of two down from [1] and so on. As the
- slots of the stack get filled up, they get cascaded together by
- the inner loop.
- =
- You could skip all that, and just pairwise process repeatedly
- until the list of intermediate values is empty, but whatever. And
- there seems to be some super small gain in efficiency in using
- identical support for two groups being combined, in that you don't
- have to consider different Ma and Mb in the computation. One less
- divide I guess)
- =
- */
- function pairwise_update (values, M, Sum, S, min, max, key){
- if(!key){key=3D'risk';}
- if(!Sum){Sum =3D 0; S =3D 0; M=3D0;}
- if(!S){Sum =3D 0; S =3D 0; M=3D0;}
- if(!M){Sum =3D 0; S =3D 0; M=3D0;}
- if(!min){ min =3D Infinity; }
- if(!max){ max =3D -Infinity; }
- var T;
- var stack_ptr=3D1;
- var N =3D values.length;
- var half =3D Math.floor(N/2);
- var NewSum;
- var NewS ;
- var SumA=3D[];
- var SA=3D[];
- var Terms=3D[];
- Terms[0]=3D0;
- if(N =3D=3D 1){
- Nsum=3Dvalues[0][key];
- Ns=3D0;
- }else if(N > 1){
- // loop over the data pairwise
- for(var i =3D 0; i < half; i++){
- // check min max
- if(values[2*i+1][key] < values[2*i][key] ){
- min =3D Math.min(values[2*i+1][key], min);
- max =3D Math.max(values[2*i][key], max);
- }else{
- min =3D Math.min(values[2*i][key], min);
- max =3D Math.max(values[2*i+1][key], max);
- }
- SumA[stack_ptr]=3Dvalues[2*i+1][key] + values[2*i][key];
- var diff =3D values[2*i + 1][key] - values[2*i][key] ;
- SA[stack_ptr]=3D( diff * diff ) / 2;
- Terms[stack_ptr]=3D2;
- while( Terms[stack_ptr] =3D=3D Terms[stack_ptr-1]){
- // combine the top two elements in storage, as
- // they have equal numbers of support terms. this
- // should happen for powers of two (2, 4, 8, etc).
- // Everything else gets cleaned up below
- stack_ptr--;
- Terms[stack_ptr]*=3D2;
- // compare this diff with the below diff. Here
- // there is no multiplication and division of the
- // first sum (SumA[stack_ptr]) because it is the
- // same size as the other.
- var diff =3D SumA[stack_ptr] - SumA[stack_ptr+1];
- SA[stack_ptr]=3D SA[stack_ptr] + SA[stack_ptr+1] +
- (diff * diff)/Terms[stack_ptr];
- SumA[stack_ptr] +=3D SumA[stack_ptr+1];
- } // repeat as needed
- stack_ptr++;
- }
- stack_ptr--;
- // check if N is odd
- if(N % 2 !=3D 0){
- // handle that dangling entry
- stack_ptr++;
- Terms[stack_ptr]=3D1;
- SumA[stack_ptr]=3Dvalues[N-1][key];
- SA[stack_ptr]=3D0; // the variance of a single observation is zero!
- min =3D Math.min(values[N-1][key], min);
- max =3D Math.max(values[N-1][key], max);
- }
- T=3DTerms[stack_ptr];
- NewSum=3DSumA[stack_ptr];
- NewS=3D SA[stack_ptr];
- if(stack_ptr > 1){
- // values.length is not power of two, so not
- // everything has been scooped up in the inner loop
- // above. Here handle the remainders
- for(var i =3D stack_ptr-1; i>=3D1 ; i--){
- // compare this diff with the above diff---one
- // more multiply and divide on the current sum,
- // because the size of the sets (SumA[i] and NewSum)
- // are different.
- var diff =3D Terms[i]*NewSum/T-SumA[i]; =
- NewS =3D NewS + SA[i] + =
- ( T * diff * diff )/
- (Terms[i] * (Terms[i] + T));
- NewSum +=3D SumA[i];
- T +=3D Terms[i];
- }
- }
- }
- // finally, combine NewS and NewSum with S and Sum
- return combine_S(
- {'S':NewS,'Sum':NewSum, 'M': T , 'min':min, 'max':max},
- {'S':S,'Sum':Sum, 'M': M , 'min':min, 'max':max});
- }
- =
- =
- /*
- =
- This function is attributed to Knuth, the Art of Computer
- Programming. Donald Knuth is a math god, so I am sure that it is
- numerically stable, but I haven't read the source so who knows.
- =
- The first parameter is again values, a list of objects with the expec=
tation that the variable of interest is contained under the key 'risk'. If=
this is not the case, pass the correct variable in the 7th field.
- =
- Parameters 2 through 6 are all optional. Pass nulls if you need to p=
ass a key in slot 7.
- =
- In order they are =
- =
- mean: the current mean value estimate =
- M2: the current estimate of the second moment (variance)
- n: the count of observations used in the current estimate
- min: the current min value observed
- max: the current max value observed
- =
- */
- function KnuthianOnLineVariance(values, M2, n, mean, min, max, key){
- if(!M2){ M2 =3D 0; }
- if(!n){ n =3D 0; }
- if(!mean){ mean =3D 0; }
- if(!min){ min =3D Infinity; }
- if(!max){ max =3D -Infinity; }
- if(!key){ key =3D 'risk'; }
- =
- // this algorithm is apparently a special case of the above
- // pairwise algorithm, in which you just apply one more value
- // to the running total. I don't know why bun Chan et al
- // (1979) and again in their later paper claim that using M
- // greater than 1 is always better than not.
- =
- // but this code is certainly cleaner! code based on Scott
- // Lamb's Java found at
- // http://www.slamb.org/svn/repos/trunk/projects/common/src/java/org/sl=
amb/common/stats/Sample.java
- // but modified a bit
- =
- for(var i=3D0; i>
- =3D=3D Interactive CouchDB Tutorial =3D=3D
- See [[http://labs.mudynamics.com/2009/04/03/interactive-couchdb/|this blo=
g post]], which is a CouchDB emulator (in JavaScript) that explains the bas=
ics of map/reduce, view collation and querying CouchDB RESTfully.
- =
- <>
- =3D=3D Retrieving documents without a certain field =3D=3D
- Sometimes you might need to get a list of documents that '''don't''' have=
a certain field. You can do this quite easy by emitting keys that fit the =
"undefined" condition:
- =
- {{{#!highlight javascript
- function(doc) {
- if (doc.field =3D=3D=3D void 0) {
- emit(doc.id, null);
- }
- }
- }}}
- =
- However, if you have more than just a few fields that need to be tested f=
or absence you can use another approach instead of creating a view for each=
negation:
- =
- {{{#!highlight javascript
- function (doc) {
- // List of fields to test for absence in documents, fields specified he=
re will be emitted as key
- var fields =3D [ "type", "role", "etc" ];
- =
- // Loop through our fields
- for (var idx in fields)
- {
- // Does the current field exists?
- if (!doc.hasOwnProperty(fields[idx]))
- {
- // It doesn't, emit the field name as key
- emit(fields[idx], 1);
- }
- }
- }
- }}}
- =
- For example: you can now query your view and retrieve all documents that =
do not contain the field `role` (view/NAME/?key=3D"role").
- =
- =
- =3D=3D Using views to search for sort documents geographically =3D=3D
- =
- If you use latitude/longitude information in your documents, it's not ver=
y easy to sort on proximity from a given point using the normal approach (o=
f using a key of [, ]). This happens because they're o=
n different axes, which doesn't map well onto CouchDB's treatment of the in=
dex sorting -- which is a linear sort. However, using a [[http://en.wikiped=
ia.org/wiki/Geohash|geohash]] may solve this, by letting you convert the co=
ordinates of a location into a string that sorts well (e.g., locations that=
are close share a common prefix).
- =
- (Note that I haven't actually used this approach, but this came up in IRC=
and geohashes are conceptually a good match. Please reword/refactor this e=
ntry if I've stated the problem or solution poorly.)
- =
- =
=3D=3D Get contents of an object with specific attributes e.g. doc.object=
s.[0].attribute =3D=3D
=
{{{#!highlight javascript