Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EE9DF200C31 for ; Wed, 8 Mar 2017 17:24:41 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id ED1F9160B86; Wed, 8 Mar 2017 16:24:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3F676160B75 for ; Wed, 8 Mar 2017 17:24:41 +0100 (CET) Received: (qmail 23788 invoked by uid 500); 8 Mar 2017 16:24:40 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 23775 invoked by uid 99); 8 Mar 2017 16:24:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2017 16:24:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 67BCCC18F4 for ; Wed, 8 Mar 2017 16:24:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.78 X-Spam-Level: * X-Spam-Status: No, score=1.78 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=occamsmachete-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 8kNfFMJL0Zdz for ; Wed, 8 Mar 2017 16:24:38 +0000 (UTC) Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 442925F27E for ; Wed, 8 Mar 2017 16:24:38 +0000 (UTC) Received: by mail-qk0-f170.google.com with SMTP id p64so74141524qke.1 for ; Wed, 08 Mar 2017 08:24:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=occamsmachete-com.20150623.gappssmtp.com; s=20150623; h=from:mime-version:subject:message-id:date:cc:to; bh=tY4D4nNJe1eoUhpP/wpikpafozVttx5Vf0AagDm3heU=; b=u21/Pm4rYAWNCqWlFyXoC4yEPhiuQiXVcazB8PjUtbsLcVMYixqf7M2xZDYwxy1Rdq vojE8poEeDgpayf7ch0NJacgMp8zY+TeqxEeg9LOvm0169gU/ocnmv97tm/4/QIleG/2 4YAAH2U865i21l5IsJdvPPBeXjF1jal2ql0+DJUD6loliWKOIqR+bMLiuOYY0F8FLwrO vjgy05hqHIjQhb1qlIwME6zuswmk6+osbm4kQXPiRODhil5DjLnRp5Q7YTeWJBtvUKg8 a8zC4YauY1CqoE6B0FpUJBhh8inTX2BUqAcfcdNKhZcJVJ3W+9/vXA3Nk3lR8EfR7kVf 0s1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:message-id:date:cc:to; bh=tY4D4nNJe1eoUhpP/wpikpafozVttx5Vf0AagDm3heU=; b=CxxmS4+pVQ6qejh3tFSMLCM3g+Iz8UGFwrztNUgZl8p+olYwme6ojyDRgU/DExH2TQ oNzIhOD8eaYpgq1aTT2yO02XeMT1/iJRRXJ0HWFMUb/Pnv5zEeaoC7wh+0NZcAoq/4GU DqtL54XFZ+c+xyyi/GXdgJNw4f+zvKOrr4LU0+a2K6B0Cp1euOjUgGxk6x0jqCm2S2e+ dP4gauzHymLPitYxNTrAyzH2qOFixSfW0zuoBNLBH/oi+ph/eZhMcDb+JB43oy+/YXXs B3JBVO8S8+PsCP7unhhrHvxM/2fwyUlZ/jVIH35ZU3LUQUQaEDd/4H6NCzJnf9asozEO jRiQ== X-Gm-Message-State: AMke39kImzjdvguqxSgp4RKq6YvicelluIx5vzoGxxihq8UmoOHvnP6IhopxDoI0xh4Omg== X-Received: by 10.55.76.138 with SMTP id z132mr8288087qka.128.1488989934488; Wed, 08 Mar 2017 08:18:54 -0800 (PST) Received: from ip-192-168-220-4.ec2.internal (ec2-54-196-5-39.compute-1.amazonaws.com. [54.196.5.39]) by smtp.gmail.com with ESMTPSA id 34sm2403283qtx.16.2017.03.08.08.18.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Mar 2017 08:18:50 -0800 (PST) From: Pat Ferrel Content-Type: multipart/alternative; boundary="Apple-Mail=_DCFE56E2-3C27-4E56-9220-A3D82BFA1A90" Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: LLR thresholds Message-Id: <9F739984-1F02-4212-B061-3084D9B904BA@occamsmachete.com> Date: Wed, 8 Mar 2017 08:18:48 -0800 Cc: Ted Dunning , Ted Dunning , ssc@apache.org To: Mahout Dev List X-Mailer: Apple Mail (2.3259) archived-at: Wed, 08 Mar 2017 16:24:42 -0000 --Apple-Mail=_DCFE56E2-3C27-4E56-9220-A3D82BFA1A90 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 The CCO algorithm now supports a couple ways to limit indicators by = =E2=80=9Cquality". The new way is by the value of LLR. We built a = t-digest mechanism to look at the overall density produced with = different thresholds. The higher the threshold, the lower the number of = indicators and the lower the density of the resulting indicator matrix = but also the higher the MAP score (of the full recommender). So MAP = seems to increase monotonically until it breaks down. This didn=E2=80=99t match my understanding of LLR, which is actually a = test for non-correlation. I was expecting high scores to mean highly = likelihood of non-correlation. So the actual formulation of the code = must be reversing that so the higher the score the higher the likelihood = that non-correlation is *false* (this is a treated as evidence of = correlation) The next observation is that with high thresholds we get higher MAP = scores from the recommender (expected) but this increases monotonically = until it breaks down because there are so few indicators left. This = leads us to the conclusion that MAP is not a good way to set the = threshold. We tried to looking are precision (MAP) vs recall (number of = people who get recs) and this gave ambiguous results with the data we = had. Given my questions about how LLR is actually formulated in Mahout I=E2=80=99= m unsure how to convert it into something like a confidence score or = some other way to judge the threshold that would lead to good way to = choose a threshold. Any ideas or illumination about how it=E2=80=99s = being calculated or how to judge the threshold? Long description of motivation: LLR thresholds are needed when comparing conversion events to things = that have very small dimensionality so maxIndicatorsPerIItem does not = work well. For example a location by state where there are 50, = maxIndicatorsPerItem defaults to 50 so you may end up with 50 very week = indicators. If there are strong indicators in the data, thresholds = should be the way to find them. This might lead to a few per item if the = data supports it and this should then be useful. The question above is = how to choose a threshold.= --Apple-Mail=_DCFE56E2-3C27-4E56-9220-A3D82BFA1A90--