Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.223.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKBEjTse1Ypjwmf8sbChr8o-u=KyXbzXNBKNVHcVnUpD7w6WHQ@mail.gmail.com>
References: <1368982327.55670.YahooMailNeo@web140005.mail.bf1.yahoo.com>
 <CAEccTyzhauKtaP1OxO1AP2Ev_BaY35CuCoLk-SnoZNF2M9R3Zw@mail.gmail.com>
 <CAH9ofMa6+vYjm3vGvMOKrEbOYR9LsNZ3e-JD7+-PBtdbKRvTTA@mail.gmail.com>
 <CAEccTyyJxp3i=CMuUzUOFiOTN3OiZzC0mmec--2+wJMSgK4uhA@mail.gmail.com>
 <6F9272F7-D39A-4D3D-A3B1-5F590567E5BF@gmx.de>
 <CAH9ofMZU_to6V1gjU9r_f-d9Fx3ifcW4nHOaOS8oDcCirZjfXg@mail.gmail.com>
 <CAEccTyxbQdEsOdEJjp58M4jMPhUUHt497OutvsdefVQ4F1sBkg@mail.gmail.com>
 <BB670CF3-ACEF-4740-BF90-12C8A6F6E6DE@gmx.de>
 <3CEB1525-4D75-421E-B415-17C7B13A4ABC@gmail.com>
 <CAEccTywhix_RS3RX6FZ1SSG2EkPKH=CxxsLS2qepmYOEtFuEzA@mail.gmail.com>
 <49243A28-CD7F-40AE-94C9-1CDF418C3573@occamsmachete.com>
 <CAJwFCa2BmLVQbRqZGJMCg9t23L2NWVjY_XoNFUsHFtSpbvkSuw@mail.gmail.com>
 <FC2AA567-29FC-434D-B0EA-B669A6983558@occamsmachete.com>
 <CAJwFCa2hwB6e01410MtTNXpztwHij5mm=_PcwzcB-pAmrva=kA@mail.gmail.com>
 <E314B44F-4B0F-4BA0-944A-C7E8A5A1AEAB@gmail.com>
 <CAJwFCa3XYY2Ue8c1FTBSf62hyXhMDcwdOFDCWDW2pEq6y7_0+A@mail.gmail.com>
 <CAKBEjTtp2P8cHwaWRxDRvRKzxd7hoBnbf0EcFALREFUm2sGvQg@mail.gmail.com>
 <CAJwFCa0+erSzhai=kpf-yc1ZseCiSimvdDuuO+f0tSF07bFzWg@mail.gmail.com>
 <CAKBEjTsiyx=tCLocS_B0MA5LrYsG1gae+ge7Z9eFiv1W4D=VzQ@mail.gmail.com>
 <CAJwFCa1jfMN3rNjOk=qkgEPaWHPiCAuvQ_wX7N8cKt4JP-1cWw@mail.gmail.com>
 <CAKBEjTse1Ypjwmf8sbChr8o-u=KyXbzXNBKNVHcVnUpD7w6WHQ@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Tue, 21 May 2013 23:30:11 -0700
Message-ID: 
 <CAJwFCa0d46OAjrg7O21pQxOBqfA_xVWHQX60BxOJZ_CMP+pAVw@mail.gmail.com>
Subject: Re: Which database should I use with Mahout
To: "user@mahout.apache.org" <user@mahout.apache.org>
Content-Type: multipart/alternative; boundary=90e6ba5bca8f7480fc04dd48b405

--90e6ba5bca8f7480fc04dd48b405
Content-Type: text/plain; charset=UTF-8

On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> Thanks for the list...as a non native speaker I got problems understanding
> the meaning of dithering here.
>

Sorry about that.  Your English is good enough that I hadn't noticed any
deficit.

Dithering is constructive mixing of the recommendation results.  The idea
is to reorder the top results only slightly and the deeper results more so.
 There are several good effects and one (slightly) bad one.

The good effects are:

a) there is much less of a sharp boundary at the end of the first page of
results.  This makes the statistics of the recommender work better and also
helps the recommender not get stuck recommending just the things which
already appear on the first page.

b) results that are very deep in the results can still be shown
occasionally.  This means that if the rec engine has even a hint that
something is good, it has a chance of increasing the ranking by gathering
more data.  This is a bit different from (a)

c) (bonus benefit) users like seeing novel things.  Even if they have done
nothing to change their recommendations, they like seeing that you have
changed something so they keep coming back to the recommendation page.

The major bad effect is that you are purposely decreasing relevance in the
short term in order to get more information that will improve relevance in
the long term.  The improvements dramatically outweigh this small problem.


> I got the feeling that somewhere between a) and d) there is also
> diversification of items in the recommendation list, so increasing the
> distance between the list items according to some metric like tf/idf on
> item information. Never tried that, but with lucene / solr it should be
> possible to use this information during scoring..
>

Yes.  But no.

This can be done at the presentation tier entirely.  I often do it by
defining a score based solely on rank, typically something like log(r).  I
add small amounts of noise to this synthetic score, often distributed
exponentially with a small mean.  Then I sort the results according to this
sum.

Here are some simulated results computed using R:

> (order((log(r) - runif(500, max=2)))[1:20])
 [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
 [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
 [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
 [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
 [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
 [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
 [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
 [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25

As you can see, the first four results are commonly single digits.  This
comes about because the uniform noise that I have subtracted from the log
can only make a difference of 2 to the log with is equivalent of changing
the rank by a factor of about 7. If we were to use different noise
distributions we would get somewhat different kinds of perturbation.  For
instance, using exponentially distributed noise gives mostly tame results
with some real surprises:

> (order((log(r) - 0.3*rexp(500)))[1:20])
 [1]  1  2  3  8  4  5  9  6  7 25 14 11 13 24 10 31 34 12 22 21
 [1]  1  2  5  4  3  6  7 12  8 10  9 17 13 11 14 25 64 15 47 19
 [1]  1  2  3  4  5  6  7 10  8  9 11 21 13 12 15 16 14 25 18 33
 [1]  1  2  3 10  4  5  7 14  6  8 13  9 15 25 16 11 20 12 17 54
 [1]  1  3  2  4  7  5  6  8 11 23  9 32 18 10 13 15 12 48 14 19
 [1]  1  3  2  4  5 10 12  6  9  7  8 18 16 17 11 13 25 14 15 19
 [1]  6  1  2  4  3  5  9 11  7 15  8 10 14 12 19 16 13 25 39 18
 [1]  1  2  3  4 30  5  7  6  9  8 16 11 10 15 12 13 37 14 31 23
 [1]  1  2  3  4  9 16  5  6  8  7 10 13 11 17 15 19 12 20 14 26
 [1]  1  2  3 13  5  4  7  6  8 15 12 11  9 10 36 14 24 70 19 16
 [1]   1   2   6   3   5   4  11  22   7   9 250   8  10  15  12  17 13  40
 16  14


> Have a nice day
>
>
>
>
> On Wed, May 22, 2013 at 2:30 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > I have so far just used the weights that Solr applies natively.
> >
> > In my experience, what makes a recommendation engine work better is, in
> > order of importance,
> >
> > a) dithering so that you gather wider data
> >
> > b) using multiple sources of input
> >
> > c) returning results quickly and reliably
> >
> > d) the actual algorithm or weighting scheme
> >
> > If you can cover items a-c in a real business, you are very lucky.  The
> > search engine approach handles (b) and (c) by nature which massively
> > improves the likelihood of ever getting to examine (d).
> >
> >
> > On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
> > johannes.schulte@gmail.com> wrote:
> >
> > > Thanks! Could you also add how to learn the weights you talked about,
> or
> > at
> > > least a hint? Learning weights for search engine query terms always
> > sounds
> > > like  "learning to rank" to me but this always seemed pretty
> complicated
> > > and i never managed to try it out..
> > >
> > >
> >
>

--90e6ba5bca8f7480fc04dd48b405--