mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: recommenditembased returns 0 records from last map-reduce job
Date Mon, 21 Jul 2014 19:57:45 GMT
temp/preparePreferenceMatrix/ratingMatrix
has data looks like it's similarity between items...
I'm confused. How can I get item similarity?


2014-07-21 23:48 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:

> The code snippet:
>
>  @Test//(enabled = false)
>     void testReadAll(){
>         (0..5).each {
>
>             def pathToFile = new Path('matrixSim/part-r-0000$it")
>             println pathToFile
>             def reader = new SequenceFile.Reader(new Configuration(),
> SequenceFile.Reader.file(pathToFile));
>             IntWritable key = new IntWritable();
>             VectorWritable value = new VectorWritable();
>             while(reader.next(key, value)){
>                 def itr = value.get().iterateNonZero()
>                 while(itr.hasNext()){
>                     println itr.next()
>                 }
>             }
>             reader.close();
>         }
>     }
>
>
>  2014-07-21 23:46 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:
>
> I've parsed it via java, matrix is empty. why?
>>
>>
>> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:
>>
>> 0.7-cdh4.7.0
>>> Anyway, recommenditembased does produce these catalogs:
>>>
>>> /recommenditembased/temp/maxValues.bin
>>> /recommenditembased/temp/norms.bin
>>> /recommenditembased/temp/numNonZeroEntries.bin
>>> /recommenditembased/temp/pairwiseSimilarity
>>> /recommenditembased/temp/partialMultiply
>>> /recommenditembased/temp/prePartialMultiply1
>>> /recommenditembased/temp/prePartialMultiply2
>>> /recommenditembased/temp/preparePreferenceMatrix
>>> /recommenditembased/temp/similarityMatrix
>>> /recommenditembased/temp/weights
>>>
>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>>> In eed. Right now I try to read it using
>>>
>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>> )  as (intId: int, vector:tuple(cardinality:int,
>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>
>>>
>>> Looks like the vector is empty... Or i do something wrong.
>>>
>>>
>>>
>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
>>>
>>> Which version of Mahout?
>>>>
>>>>
>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>> processing
>>>> > Job-Specific
>>>> >
>>>> > sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/output
>>>> > sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/temp
>>>> > sudo -u oozie mahout recommenditembased \
>>>> >                     --input \
>>>> >
>>>> >
>>>> >
>>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>> > \
>>>> >                     --output \
>>>> >                     hdfs://nameservice1/recommenditembased/output \
>>>> >                     --similarityClassname \
>>>> >                     SIMILARITY_LOGLIKELIHOOD \
>>>> >                    --numRecommendations \
>>>> >                     500 \
>>>> >                     --booleanData \
>>>> >                     false \
>>>> >                     --maxPrefsPerUser \
>>>> >                     1000 \
>>>> >                     --maxSimilaritiesPerItem \
>>>> >                     1000 \
>>>> >                     --minPrefsPerUser \
>>>> >                     5 \
>>>> >                     --maxPrefsPerUserInItemSimilarity \
>>>> >                     30 \
>>>> >                     --threshold \
>>>> >                    1.1 \
>>>> >                     --tempDir \
>>>> >                     hdfs://nameservice1/recommenditembased/temp \
>>>> >                     --outputPathForSimilarityMatrix \
>>>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>>>> >
>>>> >
>>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>> >
>>>> >
>>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>> >
>>>> > > Serega,
>>>> > >
>>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>>> options to
>>>> > > the recommenditembased command:
>>>> > >
>>>> > > sudo -u oozie mahout recommenditembased \
>>>> > >                    --input visited_items_with_inverted_items \
>>>> > >
>>>> > >                    --output result \
>>>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD
\
>>>> > >                    --usersFile inverted_items \
>>>> > >                    --numRecommendations 500 \
>>>> > >                    --booleanData false \
>>>> > >                    --maxPrefsPerUser 100 \
>>>> > >                    --maxSimilaritiesPerItem 500 \
>>>> > >                    --minPrefsPerUser 0\
>>>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>>>> > >                    --threshold 0.91 \
>>>> > >                    --tempDir  temp \
>>>> > >                    --outputPathForSimilarityMatrix similarityMatri
\
>>>> > >
>>>> > >
>>>> > > Peng Zhang
>>>> > > pzhang.xjtu@gmail.com
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > > > I've inspected the code, our approach wouldn't work with
>>>> > > booleanData=false.
>>>> > > > We do calcualte imte similarity in the wrong way...(((
>>>> > > > Thank you
>>>> > > > 1. We provide "fake" user_id and provide --usersFile in order
to
>>>> get
>>>> > > > recommendations for "fake user_id, where user_id is a negative
>>>> item_id.
>>>> > > It
>>>> > > > worked when we did provide user_id->item_id pairs without
>>>> preference.
>>>> > > > 2. Our target is to get item similarities. We tried
>>>> > > >
>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>>> > > it
>>>> > > > returns bad result comparing to RecommenderJob with our "fake"
>>>> user_id
>>>> > > > (inverted item_id)
>>>> > > >
>>>> > > > 1. I'll try the option you provided.
>>>> > > > 2. I will remove input with fake user_id and usersFile with
these
>>>> fake
>>>> > > ids
>>>> > > >
>>>> > > > 3.
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>>> option
>>>> > to
>>>> > > > RecommenderJob
>>>> > > >
>>>> > > >
>>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>> > > >
>>>> > > >> Seraga,
>>>> > > >>
>>>> > > >> I have two comments:
>>>> > > >> 1. Don’t use negative user ids. Since Mahout uses user
id as
>>>> well as
>>>> > > item
>>>> > > >> id as the row/column index, you’d better use 0, 1, 2,
etc as ids
>>>> > > >> 2. If you want to get the item similarity information,
you can
>>>> use
>>>> > > >> --outputPathForSimilarityMatrix in the command
>>>> > > >>
>>>> > > >> Regards,
>>>> > > >> Peng Zhang
>>>> > > >> M: +86 186-1658-7856
>>>> > > >> pzhang.xjtu@gmail.com
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>> > >
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >>> All bad things happen here:
>>>> > > >>>
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> Name
>>>> > > >>>
>>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>> > > >>>
>>>> > > >>> User
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Process User
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Group
>>>> > > >>>
>>>> > > >>> oozie
>>>> > > >>>
>>>> > > >>> Mapper Class
>>>> > > >>>
>>>> > > >>> PartialMultiplyMapper
>>>> > > >>>
>>>> > > >>> Reducer Class
>>>> > > >>>
>>>> > > >>> AggregateAndRecommendReducer
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> Job Input Directory
>>>> > > >>>
>>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>> > > >>>
>>>> > > >>> Job Output Directory
>>>> > > >>>
>>>> > > >>> hdfs://nameservice1/itemrec/output/
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>> > records=3312879
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>> > records=3313251
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
input
>>>> > > records=3313251
>>>> > > >>>
>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
output
>>>> records=0
>>>> > > >>>
>>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>>>> > > >> (preferences
>>>> > > >>> are ignored...?)
>>>> > > >>>
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>> > >:
>>>> > > >>>
>>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>> > > >>>> users_file:
>>>> > > >>>> --inverted_item_id
>>>> > > >>>> -1
>>>> > > >>>> -2
>>>> > > >>>> -3
>>>> > > >>>> -4
>>>> > > >>>>
>>>> > > >>>> users_items_prefs
>>>> > > >>>> --inverted item_id
>>>> > > >>>> -1 1 1.0
>>>> > > >>>> -2 2 1.0
>>>> > > >>>> -3 3 1.0
>>>> > > >>>> -4 4 1.0
>>>> > > >>>> --user_id item_id pref_value
>>>> > > >>>> 11   1 1.6
>>>> > > >>>> 11   2 1.6
>>>> > > >>>> 123 3 2.0
>>>> > > >>>> 123 4 2.0
>>>> > > >>>> 333 1 2.0
>>>> > > >>>> 333 2 1.6
>>>> > > >>>> --e.t.c.
>>>> > > >>>>
>>>> > > >>>> if I set --booleanData true
>>>> > > >>>> then mahout returns the result.
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>> > > andrew.musselman@gmail.com
>>>> > > >>> :
>>>> > > >>>>
>>>> > > >>>> I'm confused about how you're constructing the
user file, and
>>>> why
>>>> > > there
>>>> > > >>>>> are negated item ids here.
>>>> > > >>>>>
>>>> > > >>>>> Can you post some more details please, including
Mahout
>>>> version and
>>>> > > >> some
>>>> > > >>>>> sample data sets?
>>>> > > >>>>>
>>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak
<
>>>> > > >> serega.sheypak@gmail.com>
>>>> > > >>>>> wrote:
>>>> > > >>>>>>
>>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>>> > > >>>>>> I gather items which users visit during
shopping and then
>>>> create a
>>>> > > >> file:
>>>> > > >>>>>> user_id, item_id, weight (where weight
can be: [1.0, 1.6,
>>>> 1.9],
>>>> > > >> depends
>>>> > > >>>>> on
>>>> > > >>>>>> user action type and data source)
>>>> > > >>>>>> UNION
>>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>> > > >>>>>>
>>>> > > >>>>>> and I do provide a userFile, where user_id
= -item_id
>>>> > > >>>>>>
>>>> > > >>>>>> The idea is to get item similary. If any
user visits item
>>>> named
>>>> > > "A", i
>>>> > > >>>>> want
>>>> > > >>>>>> to show him items "B", "c", "xxx" using
preferences of other
>>>> > users.
>>>> > > >>>>>>
>>>> > > >>>>>> The problem is that the last (???) mapreduce
job returns 0
>>>> rows:
>>>> > > >>>>>>
>>>> > > >>>>>> Here are my settings:
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> sudo -u oozie mahout recommenditembased
\
>>>> > > >>>>>>                  --input visited_items_with_inverted_items
\
>>>> > > >>>>>>
>>>> > > >>>>>>                  --output result \
>>>> > > >>>>>>                  --similarityClassname
>>>> SIMILARITY_LOGLIKELIHOOD \
>>>> > > >>>>>>                  --usersFile inverted_items
\
>>>> > > >>>>>>                  --numRecommendations
500 \
>>>> > > >>>>>>                  --booleanData false \
>>>> > > >>>>>>                  --maxPrefsPerUser 100
\
>>>> > > >>>>>>                  --maxSimilaritiesPerItem
500 \
>>>> > > >>>>>>                  --minPrefsPerUser 0\
>>>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity
30 \
>>>> > > >>>>>>                  --threshold 0.91 \
>>>> > > >>>>>>                  --tempDir  temp \
>>>> > > >>>>>>
>>>> > > >>>>>> Some counters... I don't get what do they
mean....
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
    USERS=7528530
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
    ROWS=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>
>>>> > >
>>>> >
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > COOCCURRENCES=35882374
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>> > PRUNED_COOCCURRENCES=0
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map input
>>>> > > records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map output
>>>> > > >> records=17570268
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=5221907
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map input
>>>> > > records=7528530
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map output
>>>> > > >> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce output
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map input
>>>> > > records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map output
>>>> > > >> records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=6626130
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce output
>>>> > > >>>>> records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map input
>>>> > > records=3312879
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map output
>>>> > > >> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce input
>>>> > > >>>>> records=3313251
>>>> > > >>>>>>
>>>> > > >>>>>> --------
>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce output
>>>> > records=0
>>>> > > >>>>>> --------
>>>> > > >>>>>>
>>>> > > >>>>>> why 0???
>>>> > > >>>>>
>>>> > > >>>>
>>>> > > >>>>
>>>> > > >>
>>>> > > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message