incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Caylor <dcay...@fh.org>
Subject Re: selecting a random subset of a view
Date Mon, 28 Jun 2010 17:01:30 GMT
I think it would be great if views supported random queries, maybe by adding
a query argument, but truly random results are rarely necessary in practice.
 For example, in generating a play list what you really need is sufficient
unpredictability that play lists yield gratifying results for repeated
playing over time.

You could get this by using Sabastian's suggestion, and request the
documents one at a time.  On the client side, generate a random hash and use
that as your startkey.  If you get no result get the first document in the
view so that it is sure to spread your results evenly over the whole list
(loop the list).  This works well if you need one document at a time.

If you want to retrieve several documents at a time, I'd consider
periodically regenerating a UUID you are using for sorting.  For a play
list, I'd probably do this when I update the document with statistics, so
for example when a song is played I'd update its record with a "last played"
timestamp, number of plays, and anything else of interest that says this
song was played, and update it with a newly generated UUID string in a sort
field, which I'd use as the key in the view.  Then I'd generate a random
hash client-side as the start key.  If you use this, you could get three at
a time and still maintain sufficient randomness for satisfactorily picking
music in an unpredictable way.  Songs would be unlikely to be played in the
same order twice (unless your list is small).

If your situation supports it you could update the sort field every time you
retrieve a document.  Whether that's practical or not depends on your
environment, the list size, the document size, and so on.  You just have to
weigh how "random" you want the list and the impact on the user's experience
of performance.  If the update is non-blocking it won't impact the user's
perceived performance but will increase the random appearance of play lists.

The occasions when it is really important to generate random values are
uncommon, such as in situations where security requires the next in a
sequence to be unguessable.  Using couchdb is already a rebalancing of
priorities for those of us who have built careers out of data driven
applications.  Sometimes the best answer lies in rethinking our requirements
pragmatically.

David Caylor



On Mon, Jun 28, 2010 at 9:15 AM, <mickael.bailly@free.fr> wrote:

> shouldn't list functions be deterministics ?
>
>
> ----- Mail Original -----
> De: "Randall Leeds" <randall.leeds@gmail.com>
> À: user@couchdb.apache.org
> Envoyé: Lundi 28 Juin 2010 17h44:36 GMT +01:00 Amsterdam / Berlin / Berne /
> Rome / Stockholm / Vienne
> Objet: Re: selecting a random subset of a view
>
> Ian's Method One, implemented as a list function, would probably do the
> trick.
>
> On Jun 28, 2010 9:04 AM, "Ian Hobson" <ian@ianhobson.co.uk> wrote:
>
> On 28/06/2010 14:29, mickael.bailly@free.fr wrote:
> >
> > Hello couchers,
> >
> > how would you do to selec...
> Hi Mickael,
>
> You want a random sample of predetermined size, from a list that you can
> only (best) access randomly. To do this you must know the number of records
> in the database.
>
> Note - I know the stats side MUCH better than the couchdb so this might not
> be implementable.
>
> Here are two methods.
>
> Method 1.
>
> Say, for example, you want 3 from 11.
>
> Take the first index with a probability of 3/11  by computing a random
> number in range 0 to 1, and taking the record if rand < 3/11 (using real
> math, not integer).
>
> If you take that record, adjust the number required down by 1.
> Reduce the number remaining by 1.
>
> Take the next record with a probability or 2/10 or 3/10
>
> Continue in like manner until you either
>
> a) Take the last record with a probability of 1/1 or
> b) Have all you want, and take the remaining records with a probability of
> 0/n
>
> To do this with couchdb I would read all the IDs into the client and filter
> them there, and then read each records separately.
>
> Method 2 -
>
> Allocate a random number to each record (from a large range - we don't want
> duplicates). This could be a sha or MD5 of the actual data.
> Sort by the random number allocated.
> Read the first N records that you need.
>
> I think this sort of index could be set up on the server, so the client
> needs only create the index, and read the first N records and remove the
> index. The work on the server will be much greater that method 1 though.
>
> If you use the MD5 or SHA function, then the order will be repeatable if
> the
> data is not changed.
>
> Regards
>
> Ian
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message