The basic reason that it is common to binarize the relationships is that
putting weights on these relationships makes it really easy to overfit,
thus giving you very goofy results.
One method for putting weights on these elements is to simply use
weight(i,j) = log ((N_rows +1)/(rowSum_i + 1)) log((N_cols +1) / (colSum_j
+1))
Where all weights are set to zero if you don't have a 1 in that cell of the
itemitem matrix.
Another reasonable weighting is to simply use row or column counts
(depending on the application). You get something very similar to this
weighting when you use a text retrieval engine to produce recommendations
where documents are columns of the itemitem matrix and you multiply by a
user history expressed in items.
On Fri, Dec 21, 2012 at 3:45 PM, Kai R. Larsen <kai.larsen@colorado.edu>wrote:
> Hi,
>
> My sincere apologies if this is a naïve question (I'm sure it is).
>
> I've engaged a programmer to take an weblog and focus on 250 pages
> containing items that may be similar (or not). The goal is create
> itemitem relationship tables where every cell contains a score for how
> similar two items are. He now tells me that only two of the (many) Mahout
> algorithms can be used to generate such tables, and those that do generate
> a distance of 1 or some other constant value between all pairs.
>
> This can't be true, can it? There must be a way to tease out such
> information from the algorithms. Any advice? Any ideas why all
> relationships would be one? While it is common for the website users to
> have visited only one page at a time, it is not pervasive.
>
> Best,
>
> Kai Larsen
>
