Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4658210C32 for ; Fri, 8 Nov 2013 21:27:43 +0000 (UTC) Received: (qmail 48702 invoked by uid 500); 8 Nov 2013 21:27:41 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 48657 invoked by uid 500); 8 Nov 2013 21:27:41 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 48649 invoked by uid 99); 8 Nov 2013 21:27:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Nov 2013 21:27:41 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.160.45] (HELO mail-pb0-f45.google.com) (209.85.160.45) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Nov 2013 21:27:34 +0000 Received: by mail-pb0-f45.google.com with SMTP id ma3so2671042pbc.18 for ; Fri, 08 Nov 2013 13:27:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=vZWUsYQ8UblM/2egaO+b2rCwlIoBUuXBTonYzXtF2x0=; b=TmxsRAoXnhF0DKDTHFXkg5IOzDiEb+06/B0lJxxlwkdyd/ygYV9xMx6A0HR8H68zgY IiPRyVKU0dWjUPzZcuRyE1UlzxCjuXipcpHhkopoAapuv/Ns/UR6Ayx3t+J+9xZ/vtA0 +nkfmiQjeE/CCvdIfwa84dW/HUeJTZzfKMQ0lv1uWVmUn0Bi5krymJ26dCkt44ToJpde 70ZVlIPcHmvq4G9/94IM0ddrIKKhQ6TX94yu+K9kmEkhxwtE346SSHPqUOAzFJpOxKzN 4ELaXmtPBBPZig1Ff8VaBq6fZiGTLpnrPWwSf6zWQxoauyi69PzrmQgJUUggQNllRi+6 aiUA== X-Gm-Message-State: ALoCoQkSV2mP3gT259Ajqg9C6u/zrahlrxskoPJJsPceG7PKBdbsWUT9CG7vNuny45IiYNePbVLN X-Received: by 10.66.184.168 with SMTP id ev8mr18123691pac.152.1383946032538; Fri, 08 Nov 2013 13:27:12 -0800 (PST) Received: from [192.168.0.2] (c-98-232-27-186.hsd1.wa.comcast.net. [98.232.27.186]) by mx.google.com with ESMTPSA id ho3sm14202404pbb.23.2013.11.08.13.27.10 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 08 Nov 2013 13:27:11 -0800 (PST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\)) Subject: Re: Solr-recommender for Mahout 0.9 From: Pat Ferrel In-Reply-To: <89632C3D-B66A-44CC-8A45-327D554F22C2@transpac.com> Date: Fri, 8 Nov 2013 13:27:08 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <597CC172-EE15-421F-840F-A4471A55B6E8@occamsmachete.com> References: <134A9E6B-7ED4-4C28-A2E7-B764309AF9A2@gmail.com> <527AC026.4070601@apache.org> <95A082A0D07CDE43A5E57CE1B646DAC501CC5935CD@USLVDCMBVP01.ingramcontent.com> <6D1E18A5-E8DA-4D61-A27E-4F0C965E2591@gmail.com> <84ACEEFD-1B42-47BE-9D17-822969B49CF1@dhuebner.com> <95A082A0D07CDE43A5E57CE1B646DAC501CC593D53@USLVDCMBVP01.ingramcontent.com> <7586D37B-8F70-419C-9CC2-8C0E57757F6A@gmail.com> <95A082A0D07CDE43A5E57CE1B646DAC501CC5941FB@USLVDCMBVP01.ingramcontent.com> <689512C0-7B97-4AD1-AFBD-4E39E3E69028@gmail.com> <9B525557-6D0E-4605-B030-BA4EFE8D0E9F@transpac.com> <89632C3D-B66A-44CC-8A45-327D554F22C2@transpac.com> To: "user@mahout.apache.org" X-Mailer: Apple Mail (2.1816) X-Virus-Checked: Checked by ClamAV on apache.org Not planning to do anything with weights at present. An ORed query = should suffice for the time being and Solr weights. There are a good = list of ways to do this later if it warrants an experiment. Thanks. Have, similar items as input, recommendations from user =93likes=94, and = just got recs from recently viewed working. Once you have online recs = from the pre-calculated model experimenting is super easy. The next step = will be to get more metadata ingested so we can try boosting by context = genre, or recent genre viewed, which is sort of in line with "more = specific scoring to find the N best from N*4 candidates=94. Also want to = do what Ted calls dithering to vary the choices you see. On Nov 8, 2013, at 10:10 AM, Ken Krugler = wrote: One other thing I should have mentioned is that if you care about = setting weights on incoming terms, you can boost them using the ^ = syntax. E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0=85" If you want to account for weights of terms in the index, it's a bit = harder. You can do simple boosting by replicating terms, or you can use = payload-based boosting, or you could code up your own Similarity class = that takes advantage of side-channel data. But in my experience the gain from applying weights to terms int he = index isn't very significant. And usually I just Solr to generate a candidate list, then I do more = specific scoring to find the N best form N*4 candidates. -- Ken On Nov 8, 2013, at 9:54am, Ted Dunning wrote: > For recommendation work, I suggest that it would be better to simply = code > out an explicit OR query. >=20 >=20 >=20 >=20 > On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler = wrote: >=20 >> Hi Pat, >>=20 >> On Nov 7, 2013, at 7:30pm, Pat Ferrel wrote: >>=20 >>> Another approach would be to weight the terms in the docs by there >> Mahout similarity strength. But that will be for another day. >>>=20 >>> My current question is whether Lucene looks at word proximity. I see = the >> query syntax supports proximity but I don=92t see that it is default = so >> that=92s good. >>=20 >> Based on your description of what you do (generate an OR query of N = terms) >> then no, you shouldn't be getting a boost from proximity. >>=20 >> Note that with edismax you can specify a phrase boost, but it will be = on >> the entire set of terms being searched, so unlikely to come into play = even >> if you were using that. >>=20 >> -- Ken >>=20 >>=20 >>>=20 >>>=20 >>> On Nov 7, 2013, at 12:41 PM, Dyer, James = >> wrote: >>>=20 >>> Best to my knowledge, Lucene does not care about the position of a >> keyword within a document. >>>=20 >>> You could bucket the ids into several fields. Then use a dismax = query >> to boost the top-tier ids more than then second, etc. >>>=20 >>> A more fine-grained approach would probably involve a custom = Similarity >> class that scales the score based on its position in the document. = If we >> did this, it might be simpler to index as 1 single-valued field so = each id >> was position+1 rather than position+100, etc. >>>=20 >>> James Dyer >>> Ingram Content Group >>> (615) 213-4311 >>>=20 >>>=20 >>> -----Original Message----- >>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] >>> Sent: Thursday, November 07, 2013 1:46 PM >>> To: user@mahout.apache.org >>> Subject: Re: Solr-recommender for Mahout 0.9 >>>=20 >>> Interesting to think about ordering and adjacentness. The index ids = are >> sorted by Mahout strength so the first id is the most similar to the = row >> key and so forth. But the query is ordered buy recency. In both cases = the >> first id is in some sense the most important. Does Solr/Lucene care = about >> closeness to the top of doc for queries or indexed docs? I don't = recall any >> mention of this. >>>=20 >>> However adjacentness has no meaning in recommendations though I = think >> it's used in default queries so I may have to account for that. >>>=20 >>> The object returned is an ordered list of ids. I use only the IDs = now >> but there are cases when the contents are also of interest; shopping >> cart/watchlist queries for example. >>>=20 >>> On Nov 7, 2013, at 10:00 AM, Dyer, James = >> wrote: >>>=20 >>> The multivalued field will obey the "positionIncrementGap" value you >> specify (default=3D100). So for querying purposes, those id's will = be 100 >> (or whatever you specified) positions apart. So a phrase search for >> adjacent ids would not match, unless you set the slop for >=3D >> positionIncrementGap. Other than this, both scenarios index the = same. >>>=20 >>> For stored fields, solr returns an array of values for multivalued >> fields, which is convienent when writing a UI. >>>=20 >>> James Dyer >>> Ingram Content Group >>> (615) 213-4311 >>>=20 >>>=20 >>> -----Original Message----- >>> From: Dominik H=FCbner [mailto:contact@dhuebner.com] >>> Sent: Thursday, November 07, 2013 11:23 AM >>> To: user@mahout.apache.org >>> Subject: Re: Solr-recommender for Mahout 0.9 >>>=20 >>> Does anyone know what the difference is between keeping the ids in a >> space delimited string and indexing a multivalued field of ids? I = recently >> tried the latter since ... it felt right, however I am not sure which = of >> both has which advantages. >>>=20 >>> On 07 Nov 2013, at 18:18, Pat Ferrel wrote: >>>=20 >>>> I have dismax (no edismax) but am not using it yet, using the = default >> query, which does use 'AND'. I had much the same though as I slept on = it. >> Changing to OR is now working much much better. So obvious it almost = bit >> me, not good in this case... >>>>=20 >>>> With only a trivially small amount of testing I'd say we have a new >> recommender on the block. >>>>=20 >>>> If anyone would like to help eyeball test the thing let me know >> off-list. There are a few instructions I'll need to give. And it = can't >> handle much load right now due to intentional design limits. >>>>=20 >>>>=20 >>>> On Nov 7, 2013, at 6:11 AM, Dyer, James = >> wrote: >>>>=20 >>>> Pat, >>>>=20 >>>> Can you give us the query it generates when you enter "vampire = werewolf >> zombie", q/qt/defType ? >>>>=20 >>>> My guess is you're using the default query parser with "q.op=3DAND" = , or, >> you're using dismax/edismax with a high "mm" (min-must-match) value. >>>>=20 >>>> James Dyer >>>> Ingram Content Group >>>> (615) 213-4311 >>>>=20 >>>>=20 >>>> -----Original Message----- >>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com] >>>> Sent: Wednesday, November 06, 2013 5:53 PM >>>> To: ssc@apache.org Schelter; user@mahout.apache.org >>>> Subject: Re: Solr-recommender for Mahout 0.9 >>>>=20 >>>> Done, >>>>=20 >>>> BTW I have the thing running on a demo site but am getting very = poor >> results that I think are related to the Solr setup. I'd appreciate = any >> ideas. >>>>=20 >>>> The sample data has 27,000 items and something like 4000 users. The >> preference data is fairly dense since the users are professional = reviewers >> and the items videos. >>>>=20 >>>> 1) The number of item-item similarities that are kept is 100. Is = this a >> good starting point? Ted, do you recall how many you used before? >>>> 2) The query is a simple text query made of space delimited video = id >> strings. These are the same ids as are stored in the item-item = similarity >> docs that Solr indexes. >>>>=20 >>>> Hit thumbs up on one video you you get several recommendations. Hit >> thumbs up on several videos you get no recs. I'm either using the = wrong >> query type or have it set up to be too restrictive. As I read through = the >> docs if someone has a suggestion or pointer I'd appreciate it. >>>>=20 >>>> BTW the same sort of thing happens with Title search. Search for >> "vampire werewolf zombie" you get no results, search for "zombie" you = get >> several. >>>>=20 >>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter = wrote: >>>>=20 >>>> Hi Pat, >>>>=20 >>>> can you create issues for 1) and 2) ? Then I will try to get this = into >>>> trunk asap. >>>>=20 >>>> Best, >>>> Sebastian >>>>=20 >>>> On 06.11.2013 19:13, Pat Ferrel wrote: >>>>> Trying to integrate the Solr-recoemmender with the latest Mahout >> snapshot. The project uses a modified RecommenderJob because it needs >> SequenceFile output and to get the location of the = preparePreferenceMatrix >> directory. If #1 and #2 are addressed I can remove the modified = Mahout code >> from the project and rely on the default implementations in Mahout = 0.9. #3 >> is a longer term issue related to the creation of a = CrossRowSimilarityJob. >>>>>=20 >>>>> I have dropped the modified code from the Solr-recommender project = and >> have a modified build of the current Mahout 0.9 snapshot. If the = following >> changes are made to Mahout I can test and release a Mahout 0.9 = version of >> the Solr-recommender. >>>>>=20 >>>>> 1. Option to change RecommenderJob output format >>>>>=20 >>>>> Can someone add an option to output a SequenceFile. I modified the >> code to do the following, note the SequenceFileOutputFormat.class as = the >> last parameter but this should really be determined with an option I = think. >>>>>=20 >>>>> Job aggregateAndRecommend =3D prepareJob( >>>>> new Path(aggregateAndRecommendInput), outputPath, >> SequenceFileInputFormat.class, >>>>> PartialMultiplyMapper.class, VarLongWritable.class, >> PrefAndSimilarityColumnWritable.class, >>>>> AggregateAndRecommendReducer.class, VarLongWritable.class, >> RecommendedItemsWritable.class, >>>>> SequenceFileOutputFormat.class); >>>>>=20 >>>>> 2. Visibility of preparePreferenceMatrix directory location >>>>>=20 >>>>> The Solr-recommender needs to find where the RecommenderJob is = putting >> it's output. >>>>>=20 >>>>> Mahout 0.8 RecommenderJob code was: >>>>> public static final String DEFAULT_PREPARE_DIR =3D >> "preparePreferenceMatrix"; >>>>>=20 >>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" >> inline in the code: >>>>> Path prepPath =3D getTempPath("preparePreferenceMatrix"); >>>>>=20 >>>>> This change to Mahout 0.9 works: >>>>> public static final String DEFAULT_PREPARE_DIR =3D >> "preparePreferenceMatrix"; >>>>> and >>>>> Path prepPath =3D getTempPath(DEFAULT_PREPARE_DIR); >>>>>=20 >>>>> You could also make this a getter method on the RecommenderJob = Class >> instead of using a public constant. >>>>>=20 >>>>> 3. Downsampling >>>>>=20 >>>>> The downsampling for maximum prefs per user has been moved from >> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob = uses >> matrix math instead of RSJ so it will no longer support downsampling = until >> there is a hypothetical CrossRowSimilairtyJob with downsampling in = it. >>>>>=20 >>>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>=20 >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr