mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Fwd: [jira] Created: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
Date Tue, 23 Jun 2009 21:13:11 GMT
Transposing my answer to mahout-user based on Grant's suggestion:

---------- Forwarded message ----------
From: Ted Dunning <ted.dunning@gmail.com>
Date: Tue, Jun 23, 2009 at 2:12 PM
Subject: Re: [jira] Created: (MAHOUT-138) Convert main() methods to use
Commons CLI for argument processing
To: mahout-dev@lucene.apache.org



This is what is traditionally done, but it is distinctly sub-optimal in many
ways.  The most serious problem is that there is a heuristic decision that
says what is important what is not.

A preferable (and as far as I know never used or implemented) approach would
be to build a real model that includes factors that actually help predict
the desired outcome.  Methods to do this might include:

a) LLR feature selection from several behavior types followed by IDF
weighted scoring.   I have used this with additional follow on steps in
attrition and loss models for insurance with very good results, but never
used it in recommendations.  The basic idea in the attrition and loss models
was to develop positive and negative indicator sets for each outcome and
then cluster in the space of indicator scores.  Finally, we built ANN models
over the variables formed by distances to cluster centroids.   For
recommendations, this would mean building positive and negative feature sets
for all items for each kind of behavior.  I would expect little gain from
negative scores but would still use them.  With positive only sets, this
reduces (almost) to the sum of cooccurrence scores done in isolation on each
kind of input.

b) shared latent variable reductions across multiple behavior types.  For
SVD or similar decomposition based techniques, this is equivalent to
reducing column adjoined matrices for the independent behaviors.  Then, if
you have only one kind of information, you can use the SVD to fill in the
other, missing, information.

c) probabilistic latent variable approaches.  For LDA and such, you can put
all of the behavioral information together and use the model to predict
missing observations in the standard Bayesian kind of way.  This is similar
to (b), but much better founded.

On Tue, Jun 23, 2009 at 12:23 PM, Sean Owen <srowen@gmail.com> wrote:

> For example, you could write a script that combines rating,
> purchase history, demographics, in some way that you think is useful,
> to produce 'preference' values.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message