Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\))
Subject: Re: Solr-recommender for Mahout 0.9
From: Pat Ferrel <pat@occamsmachete.com>
In-Reply-To: <89632C3D-B66A-44CC-8A45-327D554F22C2@transpac.com>
Date: Fri, 8 Nov 2013 13:27:08 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <597CC172-EE15-421F-840F-A4471A55B6E8@occamsmachete.com>
References: <134A9E6B-7ED4-4C28-A2E7-B764309AF9A2@gmail.com>
 <527AC026.4070601@apache.org>
 <C6655886-F907-4018-9185-749C6A399CDA@gmail.com>
 <95A082A0D07CDE43A5E57CE1B646DAC501CC5935CD@USLVDCMBVP01.ingramcontent.com>
 <6D1E18A5-E8DA-4D61-A27E-4F0C965E2591@gmail.com>
 <84ACEEFD-1B42-47BE-9D17-822969B49CF1@dhuebner.com>
 <95A082A0D07CDE43A5E57CE1B646DAC501CC593D53@USLVDCMBVP01.ingramcontent.com>
 <7586D37B-8F70-419C-9CC2-8C0E57757F6A@gmail.com>
 <95A082A0D07CDE43A5E57CE1B646DAC501CC5941FB@USLVDCMBVP01.ingramcontent.com>
 <689512C0-7B97-4AD1-AFBD-4E39E3E69028@gmail.com>
 <9B525557-6D0E-4605-B030-BA4EFE8D0E9F@transpac.com>
 <CAJwFCa0AhiR6=S9VOF3s31j4-m6n9NuwhOFv9=CWqbPnxMQvPA@mail.gmail.com>
 <89632C3D-B66A-44CC-8A45-327D554F22C2@transpac.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>

Not planning to do anything with weights at present. An ORed query =
should suffice for the time being and Solr weights. There are a good =
list of ways to do this later if it warrants an experiment. Thanks.

Have, similar items as input, recommendations from user =93likes=94, and =
just got recs from recently viewed working. Once you have online recs =
from the pre-calculated model experimenting is super easy. The next step =
will be to get more metadata ingested so we can try boosting by context =
genre, or recent genre viewed, which is sort of in line with "more =
specific scoring to find the N best from N*4 candidates=94. Also want to =
do what Ted calls dithering to vary the choices you see.

On Nov 8, 2013, at 10:10 AM, Ken Krugler <kkrugler_lists@transpac.com> =
wrote:

One other thing I should have mentioned is that if you care about =
setting weights on incoming terms, you can boost them using the ^<value> =
syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0=85"

If you want to account for weights of terms in the index, it's a bit =
harder. You can do simple boosting by replicating terms, or you can use =
payload-based boosting, or you could code up your own Similarity class =
that takes advantage of side-channel data.

But in my experience the gain from applying weights to terms int he =
index isn't very significant.

And usually I just Solr to generate a candidate list, then I do more =
specific scoring to find the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunning@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply =
code
> out an explicit OR query.
>=20
>=20
>=20
>=20
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler =
<kkrugler_lists@transpac.com>wrote:
>=20
>> Hi Pat,
>>=20
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>=20
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>>=20
>>> My current question is whether Lucene looks at word proximity. I see =
the
>> query syntax supports proximity but I don=92t see that it is default =
so
>> that=92s good.
>>=20
>> Based on your description of what you do (generate an OR query of N =
terms)
>> then no, you shouldn't be getting a boost from proximity.
>>=20
>> Note that with edismax you can specify a phrase boost, but it will be =
on
>> the entire set of terms being searched, so unlikely to come into play =
even
>> if you were using that.
>>=20
>> -- Ken
>>=20
>>=20
>>>=20
>>>=20
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James =
<James.Dyer@ingramcontent.com>
>> wrote:
>>>=20
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>>=20
>>> You could bucket the ids into several fields.  Then use a dismax =
query
>> to boost the top-tier ids more than then second, etc.
>>>=20
>>> A more fine-grained approach would probably involve a custom =
Similarity
>> class that scales the score based on its position in the document.  =
If we
>> did this, it might be simpler to index as 1 single-valued field so =
each id
>> was position+1 rather than position+100, etc.
>>>=20
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>>=20
>>>=20
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>=20
>>> Interesting to think about ordering and adjacentness. The index ids =
are
>> sorted by Mahout strength so the first id is the most similar to the =
row
>> key and so forth. But the query is ordered buy recency. In both cases =
the
>> first id is in some sense the most important. Does Solr/Lucene care =
about
>> closeness to the top of doc for queries or indexed docs? I don't =
recall any
>> mention of this.
>>>=20
>>> However adjacentness has no meaning in recommendations though I =
think
>> it's used in default queries so I may have to account for that.
>>>=20
>>> The object returned is an ordered list of ids. I use only the IDs =
now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>>=20
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James =
<James.Dyer@ingramcontent.com>
>> wrote:
>>>=20
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=3D100).  So for querying purposes, those id's will =
be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=3D
>> positionIncrementGap.  Other than this, both scenarios index the =
same.
>>>=20
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>>=20
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>>=20
>>>=20
>>> -----Original Message-----
>>> From: Dominik H=FCbner [mailto:contact@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>=20
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I =
recently
>> tried the latter since ... it felt right, however I am not sure which =
of
>> both has which advantages.
>>>=20
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>>=20
>>>> I have dismax (no edismax) but am not using it yet, using the =
default
>> query, which does use 'AND'. I had much the same though as I slept on =
it.
>> Changing to OR is now working much much better. So obvious it almost =
bit
>> me, not good in this case...
>>>>=20
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>>=20
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it =
can't
>> handle much load right now due to intentional design limits.
>>>>=20
>>>>=20
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James =
<James.Dyer@ingramcontent.com>
>> wrote:
>>>>=20
>>>> Pat,
>>>>=20
>>>> Can you give us the query it generates when you enter "vampire =
werewolf
>> zombie", q/qt/defType ?
>>>>=20
>>>> My guess is you're using the default query parser with "q.op=3DAND" =
, or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>>=20
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>>=20
>>>>=20
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>>=20
>>>> Done,
>>>>=20
>>>> BTW I have the thing running on a demo site but am getting very =
poor
>> results that I think are related to the Solr setup. I'd appreciate =
any
>> ideas.
>>>>=20
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional =
reviewers
>> and the items videos.
>>>>=20
>>>> 1) The number of item-item similarities that are kept is 100. Is =
this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video =
id
>> strings. These are the same ids as are stored in the item-item =
similarity
>> docs that Solr indexes.
>>>>=20
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the =
wrong
>> query type or have it set up to be too restrictive. As I read through =
the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>>=20
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you =
get
>> several.
>>>>=20
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ssc@apache.org> =
wrote:
>>>>=20
>>>> Hi Pat,
>>>>=20
>>>> can you create issues for 1) and 2) ? Then I will try to get this =
into
>>>> trunk asap.
>>>>=20
>>>> Best,
>>>> Sebastian
>>>>=20
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the =
preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified =
Mahout code
>> from the project and rely on the default implementations in Mahout =
0.9. #3
>> is a longer term issue related to the creation of a =
CrossRowSimilarityJob.
>>>>>=20
>>>>> I have dropped the modified code from the Solr-recommender project =
and
>> have a modified build of the current Mahout 0.9 snapshot. If the =
following
>> changes are made to Mahout I can test and release a Mahout 0.9 =
version of
>> the Solr-recommender.
>>>>>=20
>>>>> 1. Option to change RecommenderJob output format
>>>>>=20
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as =
the
>> last parameter but this should really be determined with an option I =
think.
>>>>>=20
>>>>> Job aggregateAndRecommend =3D prepareJob(
>>>>>       new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>       PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>       AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>       SequenceFileOutputFormat.class);
>>>>>=20
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>>=20
>>>>> The Solr-recommender needs to find where the RecommenderJob is =
putting
>> it's output.
>>>>>=20
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =3D
>> "preparePreferenceMatrix";
>>>>>=20
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath =3D getTempPath("preparePreferenceMatrix");
>>>>>=20
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =3D
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath =3D getTempPath(DEFAULT_PREPARE_DIR);
>>>>>=20
>>>>> You could also make this a getter method on the RecommenderJob =
Class
>> instead of using a public constant.
>>>>>=20
>>>>> 3. Downsampling
>>>>>=20
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob =
uses
>> matrix math instead of RSJ so it will no longer support downsampling =
until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in =
it.
>>>>>=20
>>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>=20
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr