mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Setting up a recommender
Date Mon, 22 Jul 2013 16:20:08 GMT

Love the academics but I agree with this. Recently saw a VP from Netflix plead with the audience
(mostly academics) to move past RMSE--focus on maximizing correct ranking, not rating prediction.

Anyway I have a pipeline that does the following:
ingests logs either TSV or CSV of arbitrary column ordering--will pick out the actions by
position and string 
replaces PreparePreferenceMatrixJob to create n matrixes depending on the number of actions
you are splitting out. This job also creates external <-> internal item and user id
BiHashMaps for going back and forth between the log's IDs and Mahout internal IDs. It guarantees
a uniform item and user ID space and sparse matrix ranks by creating one from all actions.
Not completely scalable since it is not done in m/r though it uses HDFS--I have a plan to
m/r the process and get rid of the hashmap.
performs the RowSimilarityJob on the primary matrix "B" and does B'A to create a cooccurrence
matrix for primary and secondary actions.
It then goes on to use the rest of the mahout pipeline on B to get recs and does a [B'A]H_v
to calculate all cross-recommendations.
Stores all recs from all models in a NoSQL DB.
At rec request time it does a linear combination of req and cross-rec to return the highest
scored ones. The stored IDs were external so all ready for display.
Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to Solr as the original
external IDs from the log files, which were strings. This allows them to be treated as terms
by Solr.

My understanding of the Solr proposal puts B's row similarity matrix in a vector per item.
That means each row is turned into "terms" = external IDs--not sure how the weights of each
term are encoded.  So the cross-recommender would just put the cross-action similarity matrix
 in other field(s) on the same itemID/docID, right?

Then the straight out recommender queries on the B'B field(s) and the cross-recommender queries
on the B'A field(s). I suppose to keep it simple the cross-action similarity matrix could
be put in a separate index.  Is this about right?

On Jul 21, 2013, at 5:30 PM, Sebastian Schelter <> wrote:

At the moment, the down sampling is done by PreparePreferenceMatrixJob
for the collaborative filtering functionality. We just want to move it
down to RowSimilarityJob to enable standalone usage.

I think that the CrossRecommender should be the next thing on our
agenda, after we have the deployment infrastructure.  I especially like
that its capable to include different kinds of interactions, as opposed
to most other (academically motivated) recommenders that focus on a
single interaction type like a rating.


On 22.07.2013 02:14, Ted Dunning wrote:
> The row similarity downsampling is just a matter of dropping elements at
> random from rows that have more data than we want.
> If the join that puts the row together can handle two kinds of input, then
> RowSimilarity can be easily modified to be CrossRowSimilarity.  Likewise,
> if we have two DRM's with the same row id's in the same order, we can do a
> map-side merge.  Such a merge can be very efficient on a system like MapR
> where you can control files to live on the same nodes.
> On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel <> wrote:
>> RowSimilarity downsampling? Are you referring to the a mod of the matrix
>> multiply to do cross similarity with LLR for the cross recommendations? So
>> similarity of rows of B with rows of A?
>> Sounds like you are proposing not only putting a recommender in Solr but
>> also a cross-recommender? This is why getting a real data set is
>> problematic?
>> On Jul 21, 2013, at 3:40 PM, Ted Dunning <> wrote:
>> Pat,
>> Yes.  The first part probably just is the RowSimilarity job, especially
>> after Sebastian puts in the down-sampling.
>> The new part is exactly as you say, storing the DRM into Solr indexes.
>> There is no reason to not use a real data set.  There is a strong reason to
>> use a synthetic dataset, however, in that it can be trivially scaled up and
>> down both in items and users.  Also, the synthetic dataset doesn't require
>> that the real data be found and downloaded.
>> On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel <> wrote:
>>> Read the paper, and the preso.
>>> As to the 'offline to Solr' part. It sounds like you are suggesting an
>>> item item similarity matrix be stored and indexed in Solr. One would have
>>> to create the action matrix from user profile data (preference history),
>> do
>>> a rowsimiarity job on it (using LLR similarity) and move the result to
>>> Solr. The first part of this is nearly identical to the current
>> recommender
>>> job workflow and could pretty easily be created from it I think. The new
>>> part is taking the DistributedRowMatrix and storing it in a particular
>> way
>>> in Solr, right?
>>> BTW Is there some reason not to use an existing real data set?
>>> On Jul 19, 2013, at 3:45 PM, Ted Dunning <> wrote:
>>> OK.  I think the crux here is the off-line to Solr part so let's see who
>>> else pops up.
>>> Having a solr maven could be very helpful.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message