couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mickael.bai...@free.fr
Subject Re: selecting a random subset of a view
Date Mon, 28 Jun 2010 16:15:29 GMT
shouldn't list functions be deterministics ?


----- Mail Original -----
De: "Randall Leeds" <randall.leeds@gmail.com>
À: user@couchdb.apache.org
Envoyé: Lundi 28 Juin 2010 17h44:36 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm
/ Vienne
Objet: Re: selecting a random subset of a view

Ian's Method One, implemented as a list function, would probably do the
trick.

On Jun 28, 2010 9:04 AM, "Ian Hobson" <ian@ianhobson.co.uk> wrote:

On 28/06/2010 14:29, mickael.bailly@free.fr wrote:
>
> Hello couchers,
>
> how would you do to selec...
Hi Mickael,

You want a random sample of predetermined size, from a list that you can
only (best) access randomly. To do this you must know the number of records
in the database.

Note - I know the stats side MUCH better than the couchdb so this might not
be implementable.

Here are two methods.

Method 1.

Say, for example, you want 3 from 11.

Take the first index with a probability of 3/11  by computing a random
number in range 0 to 1, and taking the record if rand < 3/11 (using real
math, not integer).

If you take that record, adjust the number required down by 1.
Reduce the number remaining by 1.

Take the next record with a probability or 2/10 or 3/10

Continue in like manner until you either

a) Take the last record with a probability of 1/1 or
b) Have all you want, and take the remaining records with a probability of
0/n

To do this with couchdb I would read all the IDs into the client and filter
them there, and then read each records separately.

Method 2 -

Allocate a random number to each record (from a large range - we don't want
duplicates). This could be a sha or MD5 of the actual data.
Sort by the random number allocated.
Read the first N records that you need.

I think this sort of index could be set up on the server, so the client
needs only create the index, and read the first N records and remove the
index. The work on the server will be much greater that method 1 though.

If you use the MD5 or SHA function, then the order will be repeatable if the
data is not changed.

Regards

Ian

Mime
View raw message