lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Bowyer (JIRA)" <>
Subject [jira] [Commented] (SOLR-3673) Random variate functions
Date Wed, 25 Jul 2012 03:05:35 GMT


Greg Bowyer commented on SOLR-3673:

Greg: for non-math folks like me, can you explain the utility of this? ie: what is an example
use case that this helps solve?
Let me get back to you on that one, I will try to find out what the data 
scientist who asked me for this has in mind

One thing that jumps out at me is that the usage of the Random generators seems completely
non-deterministic – which may seem desirable in code dealing with random numbers, but in
the case of a solr function i don't think so.
In particular it looks like the values returned for each doc by the intVal/floatVal/etc...
methods on the anonymous FunctionValues instance returned by your RandomFunction class are
dependent on the order that they are called, and won't return consistent values if they are
called multiple times for the same docid. So not only will multiple (identical) requests get
different random values for the same document, but within a single request asking for the
value of a single document multiple times will give you different values – which i believe
will wreck havock on any attempts to sort by these functions (and could easily cause problems
if they are wrapped in other functions that expect determinism)
does that make sense?
That makes perfect sense and is stupidly thought out on my part, I will look into caching
the results in the scope of the FunctionValues instance. I will talk to the person who asked
me for this in case he really does want it none deterministic, if that is the case I will
try to get him to rationalise the behaviour and codify a memoization function for introducing
determinism to the mix.

>From your viewpoint of breaking sort yes that is really bad.

I think at a minimum we should probably add a "seed" argument to all of these functions (similar
to how RandomSortField uses the field name as a seed) so that people can get consistent values
from consistent input – if they want it, if they don't they just pass in a new seed (assuming
all other things about the request and the index are equal of course) {quote}
That mostly makes sense, I am not sure what to do if an RNG is used that needs more seed data
than the end user provides, at the moment I am using the Mersenne Twister which requires 128-bits
of seed data, I am nervous about exposing the particulars of the underlying RNG, or its seeding.
I will however update the patch to provide seed data

Even if we do that though, I'm still worried about intVal(docid) returning different values
if it's called multiple times in a single request though ... it may make sense to (precompute
and) cache the random values – if not long term then at least in the lifespan of the FunctionValues
what do you think?
As above, stashing the values for each document ID seems to make sense.
> Random variate functions
> ------------------------
>                 Key: SOLR-3673
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0, 5.0
>            Reporter: Greg Bowyer
>            Assignee: Greg Bowyer
>         Attachments: SOLR-3673.patch
> Hi all
> At my $DAYJOB I have been asked to build a few random variate functions that return random
numbers bound to a distribution.
> I think these can be added to solr.
> I have a hesitation in that the code as written uses / needs uncommons math (because
we want a far better RNG than java's and because I am lazy and did not want to write distributions)
> uncommons math is apache license so we are good on that front
> anyone have any thoughts on this ?
> For reference the functions are:
> rgaussian(mean, stddev) -> Random value aligned to gaussian distribution
> rpoisson(mean) -> Random value aligned to poisson distribution
> rbinomial(n, prob) -> Random value aligned to binomial distribtion
> rcontinous(min ,max) -> random continuous value between min and max
> rdiscrete(min, max) -> Random discrete value between min and max
> rexponential(rate) -> Random value from the exponential distribution

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message