mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Mahout for item-item tables
Date Sat, 22 Dec 2012 03:47:15 GMT
The basic reason that it is common to binarize the relationships is that
putting weights on these relationships makes it really easy to over-fit,
thus giving you very goofy results.

One method for putting weights on these elements is to simply use

weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j
+1))

Where all weights are set to zero if you don't have a 1 in that cell of the
item-item matrix.

Another reasonable weighting is to simply use row or column counts
(depending on the application).  You get something very similar to this
weighting when you use a text retrieval engine to produce recommendations
where documents are columns of the item-item matrix and you multiply by a
user history expressed in items.

On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen <kai.larsen@colorado.edu>wrote:

> Hi,
>
> My sincere apologies if this is a naïve question (I'm sure it is).
>
> I've engaged a programmer to take an weblog and focus on 250 pages
> containing items that may be similar (or not).  The goal is create
> item-item relationship tables where every cell contains a score for how
> similar two items are.  He now tells me that only two of the (many) Mahout
> algorithms can be used to generate such tables, and those that do generate
> a distance of 1 or some other constant value between all pairs.
>
> This can't be true, can it?  There must be a way to tease out such
> information from the algorithms.  Any advice?  Any ideas why all
> relationships would be one?  While it is common for the website users to
> have visited only one page at a time, it is not pervasive.
>
> Best,
>
> Kai Larsen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message