mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: problems with GenericRecommenderIRStatsEvaluator:
Date Thu, 05 Nov 2009 10:47:31 GMT
On Thu, Nov 5, 2009 at 6:58 AM, michal shmueli <> wrote:
> Is this the aggregated statistics over all users?


> Just to make it more clear, for the 0.7 (last param), does it mean that for
> each user we use 70% of the data to learn and 30% for test?

You sound like you're describing a parameter to RecommenderEvaluator,
which is the thing that splits the data into training and test and
such. This is different.

No, this parameter just controls how much of the overall data to use.
You can turn it way down for a faster evaluation or use 1.0 to use all

The parameter before it, the relevance threshold, sort of controls
what you're saying. Items are split into "relevant" and "not relevant"
groups based on this threshold. Then all the user's preferences for
relevant items are removed, and recommendations are made. Precision
and recall are calculated based on how many relevant items come back.

It's a slightly funky adaptation of this metric from information
retrieval, and picking the wrong threshold will give meaningless
results. I might pass Double.NaN there to let the framework choose a
suitable value.

> 3. Scalability of the Boolean recommneder ? I'm using this setting:
> Is this recommender is scalable by the number of users?

Looks reasonable to me. What are you asking, how does it scale? Memory
requirements are dominated by the number of preference values in the
input data and scale linearly as input size grows. Running time will
grow linearly with the number of users, and is dominated by this.

> 4. Some issue that I still didn't understand... I know that the Taste demo
> doesn't required Hadoop.
> I wonder when the Hadoop is required? Thus, assume I implemented these 4
> steps:

Hadoop isn't required unless you want to use Hadoop! you probably only
need that if you want to do an off-line computation so large that you
can't get it on one computer. This would probably happen once you have
hundreds of millions of preference values. For smaller input, or when
you need real-time recommendations, you want to use the framework as
you are which does not involve Hadoop.

See for the Hadoop integration.
There is information in the javadoc. You won't find a lot of info
since this still isn't the primary way to use the recommender engine.

View raw message