Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 5874 invoked from network); 28 Jun 2010 14:53:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Jun 2010 14:53:28 -0000 Received: (qmail 27940 invoked by uid 500); 28 Jun 2010 14:53:27 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 27880 invoked by uid 500); 28 Jun 2010 14:53:27 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 27872 invoked by uid 99); 28 Jun 2010 14:53:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jun 2010 14:53:26 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of robert.newson@gmail.com designates 74.125.82.180 as permitted sender) Received: from [74.125.82.180] (HELO mail-wy0-f180.google.com) (74.125.82.180) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jun 2010 14:53:20 +0000 Received: by wyj26 with SMTP id 26so1409105wyj.11 for ; Mon, 28 Jun 2010 07:53:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=mlK5ucxFGoFxlwltzV1hL/nmT9lXBj04Qgjifd47R40=; b=FYxO+HCoQgq2X/sLF2316QvZBG0KiQkFmkj5ZGJgDm+Fdt2wg6nJr+WObJCEMYP+2H e7Qd8wzWOVwyWxPyaXt2NyDQ3a2Zo886IyuOi3orbTK4xqiXHotURDhyGoGvvflQge0E GId2pw3SmHE0c06PjxcQZOXg8EKw2CCM6qDsM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=FIxs+xgC+kO0JogItV/+SZcuJkBvw2gCuCl+yCJWXAKFDl4e6NWiO4ewlQyDNUCpDV AqMbToSJwG9HT+wOds6VVBgPLNshUlwnLg5omopDqNwg4kUZGbV0v68uxexDkPCuFhm/ Xwl45hC6H9Rf863u7dIVfK2QDfoupKl+PHqu8= MIME-Version: 1.0 Received: by 10.216.90.138 with SMTP id e10mr3823896wef.51.1277736779874; Mon, 28 Jun 2010 07:52:59 -0700 (PDT) Received: by 10.216.15.83 with HTTP; Mon, 28 Jun 2010 07:52:59 -0700 (PDT) In-Reply-To: <1227907965.1708391277735107370.JavaMail.root@zimbra5-e1.priv.proxad.net> References: <346340496.1707341277734732494.JavaMail.root@zimbra5-e1.priv.proxad.net> <1227907965.1708391277735107370.JavaMail.root@zimbra5-e1.priv.proxad.net> Date: Mon, 28 Jun 2010 15:52:59 +0100 Message-ID: Subject: Re: selecting a random subset of a view From: Robert Newson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org If your _id's are 'aa' + sha1, then you can generate a random key as your startkey parameter to the _all_docs view. The next N hits are randomly selected. It should be clarified that map/reduce methods have to be deterministic, so approaches where you'd generate random numbers to control calls to emit() are not going to work. B. On Mon, Jun 28, 2010 at 3:25 PM, wrote: > Thanks for all your answers > > I forgot to tell I try to avoid the "skip" param for performances reasons= . > > The first method Ian suggest imply to read all my doc ids... In this case= I can easily use the client random function (client is PHP). But reading a= ll doc ids is not really performance friendly. > > The second method Ian speak about, is the one that others (Sebastian, Rob= ert) are proposing : some random parameter in the document. But as Ian tell= s us, "If you use the MD5 or SHA function, then the order will be repeatabl= e if > the data is not changed."... not so random :-) > > My app is a music jukebox, I want to provide a way to fill the playlist w= ith random songs. Each song document have an _id composed of 'aa'+ a random= sha1 . > > I scratched my head already but I think the random feature should be a co= uchdb server feature, and can't be implemented client-side, but by reading = all documents ids and using client-side random function, which is, once aga= in, not really performance-friendly. > > Any idea welcome... > > Mickael > > ----- Mail Original ----- > De: "Ian Hobson" > =C0: user@couchdb.apache.org > Envoy=E9: Lundi 28 Juin 2010 16h04:09 GMT +01:00 Amsterdam / Berlin / Ber= ne / Rome / Stockholm / Vienne > Objet: Re: selecting a random subset of a view > > On 28/06/2010 14:29, mickael.bailly@free.fr wrote: >> Hello couchers, >> >> how would you do to select a random subset of a view result (a simple vi= ew with map only). >> >> > Hi Mickael, > > You want a random sample of predetermined size, from a list that you can > only (best) access randomly. To do this you must know the number of > records in the database. > > Note - I know the stats side MUCH better than the couchdb so this might > not be implementable. > > Here are two methods. > > Method 1. > > Say, for example, you want 3 from 11. > > Take the first index with a probability of 3/11 =A0by computing a random > number in range 0 to 1, and taking the record if rand < 3/11 (using real > math, not integer). > > If you take that record, adjust the number required down by 1. > Reduce the number remaining by 1. > > Take the next record with a probability or 2/10 or 3/10 > > Continue in like manner until you either > > a) Take the last record with a probability of 1/1 or > b) Have all you want, and take the remaining records with a probability > of 0/n > > To do this with couchdb I would read all the IDs into the client and > filter them there, and then read each records separately. > > Method 2 - > > Allocate a random number to each record (from a large range - we don't > want duplicates). This could be a sha or MD5 of the actual data. > Sort by the random number allocated. > Read the first N records that you need. > > I think this sort of index could be set up on the server, so the client > needs only create the index, and read the first N records and remove the > index. The work on the server will be much greater that method 1 though. > > If you use the MD5 or SHA function, then the order will be repeatable if > the data is not changed. > > Regards > > Ian > >