mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Solr-recommender for Mahout 0.9
Date Fri, 08 Nov 2013 21:27:08 GMT
Not planning to do anything with weights at present. An ORed query should suffice for the time
being and Solr weights. There are a good list of ways to do this later if it warrants an experiment.
Thanks.

Have, similar items as input, recommendations from user “likes”, and just got recs from
recently viewed working. Once you have online recs from the pre-calculated model experimenting
is super easy. The next step will be to get more metadata ingested so we can try boosting
by context genre, or recent genre viewed, which is sort of in line with "more specific scoring
to find the N best from N*4 candidates”. Also want to do what Ted calls dithering to vary
the choices you see.

On Nov 8, 2013, at 10:10 AM, Ken Krugler <kkrugler_lists@transpac.com> wrote:

One other thing I should have mentioned is that if you care about setting weights on incoming
terms, you can boost them using the ^<value> syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"

If you want to account for weights of terms in the index, it's a bit harder. You can do simple
boosting by replicating terms, or you can use payload-based boosting, or you could code up
your own Similarity class that takes advantage of side-channel data.

But in my experience the gain from applying weights to terms int he index isn't very significant.

And usually I just Solr to generate a candidate list, then I do more specific scoring to find
the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunning@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply code
> out an explicit OR query.
> 
> 
> 
> 
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <kkrugler_lists@transpac.com>wrote:
> 
>> Hi Pat,
>> 
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>> 
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>> 
>>> My current question is whether Lucene looks at word proximity. I see the
>> query syntax supports proximity but I don’t see that it is default so
>> that’s good.
>> 
>> Based on your description of what you do (generate an OR query of N terms)
>> then no, you shouldn't be getting a boost from proximity.
>> 
>> Note that with edismax you can specify a phrase boost, but it will be on
>> the entire set of terms being searched, so unlikely to come into play even
>> if you were using that.
>> 
>> -- Ken
>> 
>> 
>>> 
>>> 
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <James.Dyer@ingramcontent.com>
>> wrote:
>>> 
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>> 
>>> You could bucket the ids into several fields.  Then use a dismax query
>> to boost the top-tier ids more than then second, etc.
>>> 
>>> A more fine-grained approach would probably involve a custom Similarity
>> class that scales the score based on its position in the document.  If we
>> did this, it might be simpler to index as 1 single-valued field so each id
>> was position+1 rather than position+100, etc.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Interesting to think about ordering and adjacentness. The index ids are
>> sorted by Mahout strength so the first id is the most similar to the row
>> key and so forth. But the query is ordered buy recency. In both cases the
>> first id is in some sense the most important. Does Solr/Lucene care about
>> closeness to the top of doc for queries or indexed docs? I don't recall any
>> mention of this.
>>> 
>>> However adjacentness has no meaning in recommendations though I think
>> it's used in default queries so I may have to account for that.
>>> 
>>> The object returned is an ordered list of ids. I use only the IDs now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>> 
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <James.Dyer@ingramcontent.com>
>> wrote:
>>> 
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=100).  So for querying purposes, those id's will be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=
>> positionIncrementGap.  Other than this, both scenarios index the same.
>>> 
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Dominik Hübner [mailto:contact@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I recently
>> tried the latter since ... it felt right, however I am not sure which of
>> both has which advantages.
>>> 
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>> 
>>>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use 'AND'. I had much the same though as I slept on it.
>> Changing to OR is now working much much better. So obvious it almost bit
>> me, not good in this case...
>>>> 
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>> 
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it can't
>> handle much load right now due to intentional design limits.
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <James.Dyer@ingramcontent.com>
>> wrote:
>>>> 
>>>> Pat,
>>>> 
>>>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>>>> 
>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: ssc@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Done,
>>>> 
>>>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>>>> 
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional reviewers
>> and the items videos.
>>>> 
>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item similarity
>> docs that Solr indexes.
>>>> 
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>> 
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>> several.
>>>> 
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <ssc@apache.org> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>> trunk asap.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>> from the project and rely on the default implementations in Mahout 0.9. #3
>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>> 
>>>>> I have dropped the modified code from the Solr-recommender project and
>> have a modified build of the current Mahout 0.9 snapshot. If the following
>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>> the Solr-recommender.
>>>>> 
>>>>> 1. Option to change RecommenderJob output format
>>>>> 
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as the
>> last parameter but this should really be determined with an option I think.
>>>>> 
>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>       new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>       PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>       AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>       SequenceFileOutputFormat.class);
>>>>> 
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>> 
>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>> it's output.
>>>>> 
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> 
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>> 
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>> 
>>>>> You could also make this a getter method on the RecommenderJob Class
>> instead of using a public constant.
>>>>> 
>>>>> 3. Downsampling
>>>>> 
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>> matrix math instead of RSJ so it will no longer support downsampling until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Mime
View raw message