mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?
Date Thu, 21 Oct 2010 09:26:42 GMT
This class is used mostly in slope-one. The priorities are memory usage,
then speed, then accuracy in that context.

Because of memory concerns, I don't want to store more than an int and
double. So yes you can store a total and count. So we're left trading speed
for accuracy. Your change is a bit more accurate but slower at reads (must
divide to get the average).

For slope one, which is read-heavy, and where accuracy is not a big deal,
this isn't a worthwhile tradeoff I think.

What's the use case you have in mind?

On Thu, Oct 21, 2010 at 10:03 AM, gabeweb <gabriel_webster@htc.com> wrote:

>
> I'm a bit concerned by the implementation of FullRunningAverage.  It's
> probably not a big deal, but because it performs division each time a new
> datum is added, there will be a floating point error that will increase
> over
> time, and if you average together millions of data points, this error may
> become significant.  I know that Java doubles have really huge precision,
> but still.
>
> In addition, the implementation seems rather inefficient in that it
> performs
> 6 math operations every time a datum is added.  Since adding to a running
> average is likely to happen much more often than accessing the average,
> wouldn't it be a lot more efficient to maintain the *total* and count
> internally, rather than the current average and the count, and then when
> someone requests the average, calculate the actual average on the fly by
> dividing the total by the count?  Then it would just be two addition
> operations to add a datum.  This would simultaneously eliminate the
> accumulation of floating point errors.
>
> If people agree, I'd be happy to fix this and submit a patch.  It's a
> simple
> thing, but it's sort of bothering me :)  Plus it will allow me to get
> familiar with the process.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message