mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: ALS-WR and reg rate discussion
Date Fri, 16 Dec 2011 19:03:26 GMT
i just suspect there must have been some research or study done in
terms of how accurate factorization problems are on a sumbsample.
Similar to standard errors and confidence intervals. e.g. i know how
many samples i need to fit observed mean into certain confidence
interval provided i know original distribution . So similar estimate
is sought for a factorization problem, assuming some standard mixture


On Fri, Dec 16, 2011 at 10:56 AM, Dmitriy Lyubimov <> wrote:
> the problem is convex but the idea is not to use a map reduce but a
> subsample and solve it in memory on a reduced sample (i was actually
> thinking of simple bisect rather than trying to fit to anything), but
> that's not the point .
> the point is how accurate the solution for a random subsample would
> reflect the actual optimum on the whole.
> On Fri, Dec 16, 2011 at 10:50 AM, Raphael Cendrillon
> <> wrote:
>> Hi Dmitry,
>> I have a feeling the objective may be very close to convex. In that case there are
faster approaches than random subsampling.
>> A common strategy for example is to fit a quadratic onto the previously evaluated
lambda values, and then solve it for the minimum.
>> This is an iterative approach, so wouldn't fit well within map reduce, but if you
are thinking of doing this as a preprocessing step it would be OK.
>> On Dec 16, 2011, at 10:05 AM, Dmitriy Lyubimov <> wrote:
>>> Hi,
>>> I remember vaguely the discussion of finding the optimum for reg rate
>>> in ALS-WR stuff.
>>> Would it make sense to take a subsample (or, rather, a random
>>> submatrix) of the original input and try to find optimum for it
>>> somehow, similar to total order paritioner's distribution sampling?
>>> I have put ALS with regularization and ALS-WR  (and will put the
>>> implicit feedback paper as well) into R code and i was wondering if it
>>> makes sense to find a better guess for lambda by just doing an R
>>> simulation on a randomly subsampled data before putting it into
>>> pipeline? or there's a fundamental problem with this approach?
>>> Thanks.
>>> -Dmitriy

View raw message