couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@gmail.com>
Subject Re: selecting a random subset of a view
Date Mon, 28 Jun 2010 15:16:44 GMT

On Jun 28, 2010, at 7:52 AM, Robert Newson wrote:

> If your _id's are 'aa' + sha1, then you can generate a random key as
> your startkey parameter to the _all_docs view. The next N hits are
> randomly selected.
> 
> It should be clarified that map/reduce methods have to be
> deterministic, so approaches where you'd generate random numbers to
> control calls to emit() are not going to work.
> 

Those approaches work just fine (it's just that you get a randomly sorted static view). So
the initial order is random, but it is always the same in subsequent queries. This is fine
if you have an evenly distributed keyspace, as you can use a random key generator + limit
to get random sets.

> B.
> 
> On Mon, Jun 28, 2010 at 3:25 PM,  <mickael.bailly@free.fr> wrote:
>> Thanks for all your answers
>> 
>> I forgot to tell I try to avoid the "skip" param for performances reasons.
>> 
>> The first method Ian suggest imply to read all my doc ids... In this case I can easily
use the client random function (client is PHP). But reading all doc ids is not really performance
friendly.
>> 
>> The second method Ian speak about, is the one that others (Sebastian, Robert) are
proposing : some random parameter in the document. But as Ian tells us, "If you use the MD5
or SHA function, then the order will be repeatable if
>> the data is not changed."... not so random :-)
>> 
>> My app is a music jukebox, I want to provide a way to fill the playlist with random
songs. Each song document have an _id composed of 'aa'+ a random sha1 .
>> 
>> I scratched my head already but I think the random feature should be a couchdb server
feature, and can't be implemented client-side, but by reading all documents ids and using
client-side random function, which is, once again, not really performance-friendly.
>> 
>> Any idea welcome...
>> 
>> Mickael
>> 
>> ----- Mail Original -----
>> De: "Ian Hobson" <ian@ianhobson.co.uk>
>> À: user@couchdb.apache.org
>> Envoyé: Lundi 28 Juin 2010 16h04:09 GMT +01:00 Amsterdam / Berlin / Berne / Rome
/ Stockholm / Vienne
>> Objet: Re: selecting a random subset of a view
>> 
>> On 28/06/2010 14:29, mickael.bailly@free.fr wrote:
>>> Hello couchers,
>>> 
>>> how would you do to select a random subset of a view result (a simple view with
map only).
>>> 
>>> 
>> Hi Mickael,
>> 
>> You want a random sample of predetermined size, from a list that you can
>> only (best) access randomly. To do this you must know the number of
>> records in the database.
>> 
>> Note - I know the stats side MUCH better than the couchdb so this might
>> not be implementable.
>> 
>> Here are two methods.
>> 
>> Method 1.
>> 
>> Say, for example, you want 3 from 11.
>> 
>> Take the first index with a probability of 3/11  by computing a random
>> number in range 0 to 1, and taking the record if rand < 3/11 (using real
>> math, not integer).
>> 
>> If you take that record, adjust the number required down by 1.
>> Reduce the number remaining by 1.
>> 
>> Take the next record with a probability or 2/10 or 3/10
>> 
>> Continue in like manner until you either
>> 
>> a) Take the last record with a probability of 1/1 or
>> b) Have all you want, and take the remaining records with a probability
>> of 0/n
>> 
>> To do this with couchdb I would read all the IDs into the client and
>> filter them there, and then read each records separately.
>> 
>> Method 2 -
>> 
>> Allocate a random number to each record (from a large range - we don't
>> want duplicates). This could be a sha or MD5 of the actual data.
>> Sort by the random number allocated.
>> Read the first N records that you need.
>> 
>> I think this sort of index could be set up on the server, so the client
>> needs only create the index, and read the first N records and remove the
>> index. The work on the server will be much greater that method 1 though.
>> 
>> If you use the MD5 or SHA function, then the order will be repeatable if
>> the data is not changed.
>> 
>> Regards
>> 
>> Ian
>> 
>> 


Mime
View raw message