Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BF57D658 for ; Wed, 22 May 2013 06:31:29 +0000 (UTC) Received: (qmail 23704 invoked by uid 500); 22 May 2013 06:31:27 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 23649 invoked by uid 500); 22 May 2013 06:31:27 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 23617 invoked by uid 99); 22 May 2013 06:31:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 06:31:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.223.182 as permitted sender) Received: from [209.85.223.182] (HELO mail-ie0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 06:31:04 +0000 Received: by mail-ie0-f182.google.com with SMTP id a14so4454774iee.27 for ; Tue, 21 May 2013 23:30:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=VWh7zOv45CRUETa98Jd6hFDWCt8YK64c+kl9ABzHPxE=; b=pxBISISr+vvPkjks3zORaA/gGpclaVy/8oBqUDD5twr3WydpiY4VXcT7sgmB9BXDBg 0rbl8Xs9xyc0MmdDFUeda8GLd9Pp7IPzE54+d1vh088WoXGZxBVuG4P1wiaqJ2HlRk58 7friTMBF9Tqifga+nX6Sq2rnTW9KeFoVBnD/6d6XIJ+Hm4K7FIdPfcOempQ0eGbWWJfv NjBxKeeNuP+TGTTcfixa/jqzpo3GFlmbxsedmEJHCLULAlFWw0PL0D4YzrIT9+Lm390o UyedeFnWr4feGF252imd9Iq0FGa4/qDPGigAF53H8/9DGmw1l/RqnIfzdL81m1HOE3/a vlcw== X-Received: by 10.42.64.69 with SMTP id f5mr3089193ici.29.1369204241758; Tue, 21 May 2013 23:30:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.235.228 with HTTP; Tue, 21 May 2013 23:30:11 -0700 (PDT) In-Reply-To: References: <1368982327.55670.YahooMailNeo@web140005.mail.bf1.yahoo.com> <6F9272F7-D39A-4D3D-A3B1-5F590567E5BF@gmx.de> <3CEB1525-4D75-421E-B415-17C7B13A4ABC@gmail.com> <49243A28-CD7F-40AE-94C9-1CDF418C3573@occamsmachete.com> From: Ted Dunning Date: Tue, 21 May 2013 23:30:11 -0700 Message-ID: Subject: Re: Which database should I use with Mahout To: "user@mahout.apache.org" Content-Type: multipart/alternative; boundary=90e6ba5bca8f7480fc04dd48b405 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba5bca8f7480fc04dd48b405 Content-Type: text/plain; charset=UTF-8 On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte < johannes.schulte@gmail.com> wrote: > Thanks for the list...as a non native speaker I got problems understanding > the meaning of dithering here. > Sorry about that. Your English is good enough that I hadn't noticed any deficit. Dithering is constructive mixing of the recommendation results. The idea is to reorder the top results only slightly and the deeper results more so. There are several good effects and one (slightly) bad one. The good effects are: a) there is much less of a sharp boundary at the end of the first page of results. This makes the statistics of the recommender work better and also helps the recommender not get stuck recommending just the things which already appear on the first page. b) results that are very deep in the results can still be shown occasionally. This means that if the rec engine has even a hint that something is good, it has a chance of increasing the ranking by gathering more data. This is a bit different from (a) c) (bonus benefit) users like seeing novel things. Even if they have done nothing to change their recommendations, they like seeing that you have changed something so they keep coming back to the recommendation page. The major bad effect is that you are purposely decreasing relevance in the short term in order to get more information that will improve relevance in the long term. The improvements dramatically outweigh this small problem. > I got the feeling that somewhere between a) and d) there is also > diversification of items in the recommendation list, so increasing the > distance between the list items according to some metric like tf/idf on > item information. Never tried that, but with lucene / solr it should be > possible to use this information during scoring.. > Yes. But no. This can be done at the presentation tier entirely. I often do it by defining a score based solely on rank, typically something like log(r). I add small amounts of noise to this synthetic score, often distributed exponentially with a small mean. Then I sort the results according to this sum. Here are some simulated results computed using R: > (order((log(r) - runif(500, max=2)))[1:20]) [1] 1 2 3 6 5 4 14 9 8 10 7 17 11 15 13 22 28 12 20 39 [1] 1 2 5 3 4 8 6 10 9 16 24 31 20 30 13 18 7 14 36 38 [1] 3 1 5 2 10 4 8 7 14 21 19 26 29 13 27 15 6 12 33 9 [1] 1 2 5 3 6 17 4 20 18 7 19 9 25 8 29 21 15 27 28 12 [1] 1 2 5 3 7 4 8 11 9 15 10 6 33 37 17 27 36 16 34 38 [1] 1 4 2 5 9 3 14 13 12 17 22 25 7 15 18 36 16 6 20 29 [1] 1 3 4 7 2 6 5 12 18 17 13 24 27 10 8 20 14 34 9 46 [1] 3 1 2 6 12 8 7 5 4 19 11 26 10 15 28 35 9 20 42 25 As you can see, the first four results are commonly single digits. This comes about because the uniform noise that I have subtracted from the log can only make a difference of 2 to the log with is equivalent of changing the rank by a factor of about 7. If we were to use different noise distributions we would get somewhat different kinds of perturbation. For instance, using exponentially distributed noise gives mostly tame results with some real surprises: > (order((log(r) - 0.3*rexp(500)))[1:20]) [1] 1 2 3 8 4 5 9 6 7 25 14 11 13 24 10 31 34 12 22 21 [1] 1 2 5 4 3 6 7 12 8 10 9 17 13 11 14 25 64 15 47 19 [1] 1 2 3 4 5 6 7 10 8 9 11 21 13 12 15 16 14 25 18 33 [1] 1 2 3 10 4 5 7 14 6 8 13 9 15 25 16 11 20 12 17 54 [1] 1 3 2 4 7 5 6 8 11 23 9 32 18 10 13 15 12 48 14 19 [1] 1 3 2 4 5 10 12 6 9 7 8 18 16 17 11 13 25 14 15 19 [1] 6 1 2 4 3 5 9 11 7 15 8 10 14 12 19 16 13 25 39 18 [1] 1 2 3 4 30 5 7 6 9 8 16 11 10 15 12 13 37 14 31 23 [1] 1 2 3 4 9 16 5 6 8 7 10 13 11 17 15 19 12 20 14 26 [1] 1 2 3 13 5 4 7 6 8 15 12 11 9 10 36 14 24 70 19 16 [1] 1 2 6 3 5 4 11 22 7 9 250 8 10 15 12 17 13 40 16 14 > Have a nice day > > > > > On Wed, May 22, 2013 at 2:30 AM, Ted Dunning > wrote: > > > I have so far just used the weights that Solr applies natively. > > > > In my experience, what makes a recommendation engine work better is, in > > order of importance, > > > > a) dithering so that you gather wider data > > > > b) using multiple sources of input > > > > c) returning results quickly and reliably > > > > d) the actual algorithm or weighting scheme > > > > If you can cover items a-c in a real business, you are very lucky. The > > search engine approach handles (b) and (c) by nature which massively > > improves the likelihood of ever getting to examine (d). > > > > > > On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte < > > johannes.schulte@gmail.com> wrote: > > > > > Thanks! Could you also add how to learn the weights you talked about, > or > > at > > > least a hint? Learning weights for search engine query terms always > > sounds > > > like "learning to rank" to me but this always seemed pretty > complicated > > > and i never managed to try it out.. > > > > > > > > > --90e6ba5bca8f7480fc04dd48b405--