Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F31E111C2 for ; Mon, 21 Jul 2014 19:58:34 +0000 (UTC) Received: (qmail 19580 invoked by uid 500); 21 Jul 2014 19:58:32 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 19514 invoked by uid 500); 21 Jul 2014 19:58:32 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 19502 invoked by uid 99); 21 Jul 2014 19:58:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2014 19:58:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of serega.sheypak@gmail.com designates 209.85.217.170 as permitted sender) Received: from [209.85.217.170] (HELO mail-lb0-f170.google.com) (209.85.217.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2014 19:58:27 +0000 Received: by mail-lb0-f170.google.com with SMTP id w7so4309768lbi.15 for ; Mon, 21 Jul 2014 12:58:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=44BEfM2+JmhNZdmXq4vdb6XToH6fOno4VIQWSfq5Sdo=; b=ovl1xrDpCzT5o5UIFSDeA8/l6l5L88Wlml4r4+wInGYT/SXu00jcq+dMGQe8Vh3Ae2 8+vhCOOrUAjdV6+2VROZ03241T3oZWS8VedfOe5/rL90p3Qas7qE4Sh1PwofT9WqB73p ioq0SET+6/eWIKdjj0KYy9FmO8Zo9GPHcaU2je4NIPMpiohqdLC3BQKDjkbQT8vy5gi0 Qqq4JVrT6aCNEd5PoHWRRKjjvW9D93pxPYw1lTCjDsGW6NbSXfjZgAr2z2hdy/SsKBMB csPMpqell2Y4QV7u15vAV98VrLjVOuPWhM8esVH2Vk3X8XXIWHZxTYAdLuLATclC7lqH s/8w== X-Received: by 10.152.3.65 with SMTP id a1mr13299910laa.76.1405972685781; Mon, 21 Jul 2014 12:58:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.162.196 with HTTP; Mon, 21 Jul 2014 12:57:45 -0700 (PDT) In-Reply-To: References: <62D5A66A-9BE3-4EAD-A212-F180FC11CC4B@gmail.com> <980C10CB-7736-4826-BDC2-4D9B1E31D11B@gmail.com> From: Serega Sheypak Date: Mon, 21 Jul 2014 23:57:45 +0400 Message-ID: Subject: Re: recommenditembased returns 0 records from last map-reduce job To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=089e0141a0a88000be04feb98605 X-Virus-Checked: Checked by ClamAV on apache.org --089e0141a0a88000be04feb98605 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable temp/preparePreferenceMatrix/ratingMatrix has data looks like it's similarity between items... I'm confused. How can I get item similarity? 2014-07-21 23:48 GMT+04:00 Serega Sheypak : > The code snippet: > > @Test//(enabled =3D false) > void testReadAll(){ > (0..5).each { > > def pathToFile =3D new Path('matrixSim/part-r-0000$it") > println pathToFile > def reader =3D new SequenceFile.Reader(new Configuration(), > SequenceFile.Reader.file(pathToFile)); > IntWritable key =3D new IntWritable(); > VectorWritable value =3D new VectorWritable(); > while(reader.next(key, value)){ > def itr =3D value.get().iterateNonZero() > while(itr.hasNext()){ > println itr.next() > } > } > reader.close(); > } > } > > > 2014-07-21 23:46 GMT+04:00 Serega Sheypak : > > I've parsed it via java, matrix is empty. why? >> >> >> 2014-07-21 22:41 GMT+04:00 Serega Sheypak : >> >> 0.7-cdh4.7.0 >>> Anyway, recommenditembased does produce these catalogs: >>> >>> /recommenditembased/temp/maxValues.bin >>> /recommenditembased/temp/norms.bin >>> /recommenditembased/temp/numNonZeroEntries.bin >>> /recommenditembased/temp/pairwiseSimilarity >>> /recommenditembased/temp/partialMultiply >>> /recommenditembased/temp/prePartialMultiply1 >>> /recommenditembased/temp/prePartialMultiply2 >>> /recommenditembased/temp/preparePreferenceMatrix >>> /recommenditembased/temp/similarityMatrix >>> /recommenditembased/temp/weights >>> >>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing >>> In eed. Right now I try to read it using >>> >>> matrix =3D LOAD '/recommenditembased/temp/similarityMatrix' USING >>> com.twitter.elephantbird.pig.load.SequenceFileLoader( >>> '-c com.twitter.elephantbird.pig.util.IntWritableConverter', >>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >>> ) as (intId: int, vector:tuple(cardinality:int, >>> entries:bag{t:tuple(some_id:long, some_value:double)})); >>> >>> >>> Looks like the vector is empty... Or i do something wrong. >>> >>> >>> >>> 2014-07-21 22:09 GMT+04:00 Ted Dunning : >>> >>> Which version of Mahout? >>>> >>>> >>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak < >>>> serega.sheypak@gmail.com> >>>> wrote: >>>> >>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while >>>> processing >>>> > Job-Specific >>>> > >>>> > sudo -u hdfs hadoop fs -rm -r >>>> hdfs://nameservice1/recommenditembased/output >>>> > sudo -u hdfs hadoop fs -rm -r >>>> hdfs://nameservice1/recommenditembased/temp >>>> > sudo -u oozie mahout recommenditembased \ >>>> > --input \ >>>> > >>>> > >>>> > >>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_re= c_clicks >>>> > \ >>>> > --output \ >>>> > hdfs://nameservice1/recommenditembased/output \ >>>> > --similarityClassname \ >>>> > SIMILARITY_LOGLIKELIHOOD \ >>>> > --numRecommendations \ >>>> > 500 \ >>>> > --booleanData \ >>>> > false \ >>>> > --maxPrefsPerUser \ >>>> > 1000 \ >>>> > --maxSimilaritiesPerItem \ >>>> > 1000 \ >>>> > --minPrefsPerUser \ >>>> > 5 \ >>>> > --maxPrefsPerUserInItemSimilarity \ >>>> > 30 \ >>>> > --threshold \ >>>> > 1.1 \ >>>> > --tempDir \ >>>> > hdfs://nameservice1/recommenditembased/temp \ >>>> > --outputPathForSimilarityMatrix \ >>>> > hdfs://nameservice1/recommenditembased/sim_matri= x >>>> > >>>> > >>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported. >>>> > >>>> > >>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang : >>>> > >>>> > > Serega, >>>> > > >>>> > > See the last line on how to pass outputPathForSimilarityMatrix >>>> options to >>>> > > the recommenditembased command: >>>> > > >>>> > > sudo -u oozie mahout recommenditembased \ >>>> > > --input visited_items_with_inverted_items \ >>>> > > >>>> > > --output result \ >>>> > > --similarityClassname SIMILARITY_LOGLIKELIHOOD = \ >>>> > > --usersFile inverted_items \ >>>> > > --numRecommendations 500 \ >>>> > > --booleanData false \ >>>> > > --maxPrefsPerUser 100 \ >>>> > > --maxSimilaritiesPerItem 500 \ >>>> > > --minPrefsPerUser 0\ >>>> > > --maxPrefsPerUserInItemSimilarity 30 \ >>>> > > --threshold 0.91 \ >>>> > > --tempDir temp \ >>>> > > --outputPathForSimilarityMatrix similarityMatri= \ >>>> > > >>>> > > >>>> > > Peng Zhang >>>> > > pzhang.xjtu@gmail.com >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak < >>>> serega.sheypak@gmail.com> >>>> > > wrote: >>>> > > >>>> > > > I've inspected the code, our approach wouldn't work with >>>> > > booleanData=3Dfalse. >>>> > > > We do calcualte imte similarity in the wrong way...((( >>>> > > > Thank you >>>> > > > 1. We provide "fake" user_id and provide --usersFile in order to >>>> get >>>> > > > recommendations for "fake user_id, where user_id is a negative >>>> item_id. >>>> > > It >>>> > > > worked when we did provide user_id->item_id pairs without >>>> preference. >>>> > > > 2. Our target is to get item similarities. We tried >>>> > > > >>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob bu= t >>>> > > it >>>> > > > returns bad result comparing to RecommenderJob with our "fake" >>>> user_id >>>> > > > (inverted item_id) >>>> > > > >>>> > > > 1. I'll try the option you provided. >>>> > > > 2. I will remove input with fake user_id and usersFile with thes= e >>>> fake >>>> > > ids >>>> > > > >>>> > > > 3. >>>> > > > >>>> > > >>>> > >>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/or= g/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java >>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix >>>> option >>>> > to >>>> > > > RecommenderJob >>>> > > > >>>> > > > >>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang : >>>> > > > >>>> > > >> Seraga, >>>> > > >> >>>> > > >> I have two comments: >>>> > > >> 1. Don=E2=80=99t use negative user ids. Since Mahout uses user = id as >>>> well as >>>> > > item >>>> > > >> id as the row/column index, you=E2=80=99d better use 0, 1, 2, e= tc as ids >>>> > > >> 2. If you want to get the item similarity information, you can >>>> use >>>> > > >> --outputPathForSimilarityMatrix in the command >>>> > > >> >>>> > > >> Regards, >>>> > > >> Peng Zhang >>>> > > >> M: +86 186-1658-7856 >>>> > > >> pzhang.xjtu@gmail.com >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak < >>>> serega.sheypak@gmail.com >>>> > > >>>> > > >> wrote: >>>> > > >> >>>> > > >>> All bad things happen here: >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> Name >>>> > > >>> >>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer >>>> > > >>> >>>> > > >>> User >>>> > > >>> >>>> > > >>> oozie >>>> > > >>> >>>> > > >>> Process User >>>> > > >>> >>>> > > >>> oozie >>>> > > >>> >>>> > > >>> Group >>>> > > >>> >>>> > > >>> oozie >>>> > > >>> >>>> > > >>> Mapper Class >>>> > > >>> >>>> > > >>> PartialMultiplyMapper >>>> > > >>> >>>> > > >>> Reducer Class >>>> > > >>> >>>> > > >>> AggregateAndRecommendReducer >>>> > > >>> >>>> > > >>> >>>> > > >>> Job Input Directory >>>> > > >>> >>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply >>>> > > >>> >>>> > > >>> Job Output Directory >>>> > > >>> >>>> > > >>> hdfs://nameservice1/itemrec/output/ >>>> > > >>> >>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input >>>> > records=3D3312879 >>>> > > >>> >>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output >>>> > records=3D3313251 >>>> > > >>> >>>> > > >>> >>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input >>>> > > records=3D3313251 >>>> > > >>> >>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output >>>> records=3D0 >>>> > > >>> >>>> > > >>> Why does mahout returns 0 rows? it works when booleanData=3Dtr= ue >>>> > > >> (preferences >>>> > > >>> are ignored...?) >>>> > > >>> >>>> > > >>> >>>> > > >>> >>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak < >>>> serega.sheypak@gmail.com >>>> > >: >>>> > > >>> >>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 >>>> > > >>>> users_file: >>>> > > >>>> --inverted_item_id >>>> > > >>>> -1 >>>> > > >>>> -2 >>>> > > >>>> -3 >>>> > > >>>> -4 >>>> > > >>>> >>>> > > >>>> users_items_prefs >>>> > > >>>> --inverted item_id >>>> > > >>>> -1 1 1.0 >>>> > > >>>> -2 2 1.0 >>>> > > >>>> -3 3 1.0 >>>> > > >>>> -4 4 1.0 >>>> > > >>>> --user_id item_id pref_value >>>> > > >>>> 11 1 1.6 >>>> > > >>>> 11 2 1.6 >>>> > > >>>> 123 3 2.0 >>>> > > >>>> 123 4 2.0 >>>> > > >>>> 333 1 2.0 >>>> > > >>>> 333 2 1.6 >>>> > > >>>> --e.t.c. >>>> > > >>>> >>>> > > >>>> if I set --booleanData true >>>> > > >>>> then mahout returns the result. >>>> > > >>>> >>>> > > >>>> >>>> > > >>>> >>>> > > >>>> >>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < >>>> > > andrew.musselman@gmail.com >>>> > > >>> : >>>> > > >>>> >>>> > > >>>> I'm confused about how you're constructing the user file, and >>>> why >>>> > > there >>>> > > >>>>> are negated item ids here. >>>> > > >>>>> >>>> > > >>>>> Can you post some more details please, including Mahout >>>> version and >>>> > > >> some >>>> > > >>>>> sample data sets? >>>> > > >>>>> >>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < >>>> > > >> serega.sheypak@gmail.com> >>>> > > >>>>> wrote: >>>> > > >>>>>> >>>> > > >>>>>> Hi, I'm trying to create item similarity. >>>> > > >>>>>> I gather items which users visit during shopping and then >>>> create a >>>> > > >> file: >>>> > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, >>>> 1.9], >>>> > > >> depends >>>> > > >>>>> on >>>> > > >>>>>> user action type and data source) >>>> > > >>>>>> UNION >>>> > > >>>>>> -item_id, item_id, 1.0 (from items dictionary) >>>> > > >>>>>> >>>> > > >>>>>> and I do provide a userFile, where user_id =3D -item_id >>>> > > >>>>>> >>>> > > >>>>>> The idea is to get item similary. If any user visits item >>>> named >>>> > > "A", i >>>> > > >>>>> want >>>> > > >>>>>> to show him items "B", "c", "xxx" using preferences of othe= r >>>> > users. >>>> > > >>>>>> >>>> > > >>>>>> The problem is that the last (???) mapreduce job returns 0 >>>> rows: >>>> > > >>>>>> >>>> > > >>>>>> Here are my settings: >>>> > > >>>>>> >>>> > > >>>>>> >>>> > > >>>>>> sudo -u oozie mahout recommenditembased \ >>>> > > >>>>>> --input visited_items_with_inverted_items = \ >>>> > > >>>>>> >>>> > > >>>>>> --output result \ >>>> > > >>>>>> --similarityClassname >>>> SIMILARITY_LOGLIKELIHOOD \ >>>> > > >>>>>> --usersFile inverted_items \ >>>> > > >>>>>> --numRecommendations 500 \ >>>> > > >>>>>> --booleanData false \ >>>> > > >>>>>> --maxPrefsPerUser 100 \ >>>> > > >>>>>> --maxSimilaritiesPerItem 500 \ >>>> > > >>>>>> --minPrefsPerUser 0\ >>>> > > >>>>>> --maxPrefsPerUserInItemSimilarity 30 \ >>>> > > >>>>>> --threshold 0.91 \ >>>> > > >>>>>> --tempDir temp \ >>>> > > >>>>>> >>>> > > >>>>>> Some counters... I don't get what do they mean.... >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: >>>> > > >>>>>> >>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=3D752853= 0 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>> > > >>>>>> >>>> > > >>>>> >>>> > > >> >>>> > > >>>> > >>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elem= ents >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>> > > >>>>>> USER_RATINGS_NEGLECTED=3D1,798,738 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: >>>> > > >>>>> USER_RATINGS_USED=3D12,429,693 >>>> > > >>>>>> >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: >>>> > > >>>>>> >>>> > > >>>>> >>>> > > >> >>>> > > >>>> > >>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob= $Counters >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>> > > >>>>>> >>>> > > >>>>> >>>> > > >> >>>> > > >>>> > >>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob= $Counters >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>> > COOCCURRENCES=3D35882374 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: >>>> > PRUNED_COOCCURRENCES=3D0 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input >>>> > > records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output >>>> > > >> records=3D17570268 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D5221907 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input >>>> > > records=3D7528530 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output >>>> > > >> records=3D3313251 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D3313251 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output >>>> > > >>>>> records=3D3313251 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input >>>> > > records=3D6626130 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output >>>> > > >> records=3D6626130 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D6626130 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output >>>> > > >>>>> records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input >>>> > > records=3D3312879 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output >>>> > > >> records=3D3313251 >>>> > > >>>>>> >>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input >>>> > > >>>>> records=3D3313251 >>>> > > >>>>>> >>>> > > >>>>>> -------- >>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output >>>> > records=3D0 >>>> > > >>>>>> -------- >>>> > > >>>>>> >>>> > > >>>>>> why 0??? >>>> > > >>>>> >>>> > > >>>> >>>> > > >>>> >>>> > > >> >>>> > > >> >>>> > > >>>> > > >>>> > >>>> >>> >>> >> > --089e0141a0a88000be04feb98605--