mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: mapreduce ItemSimilarity input optimization
Date Sun, 17 Aug 2014 15:02:36 GMT
1) how many cores in the cluster? The whole idea behind mapreduce is you buy more cpus you
get nearly linear decrease in runtime.
2) what is your mahout command line with options, or how are you invoking mahout. I have seen
the Mahout mapreduce recommender take this long so we should check what you are doing with
downsampling.
3) do you really need to RANK your ids, that’s a full sort? When using pig I usually get
DISTINCT ones and assign an incrementing integer as the Mahout ID corresponding
4) your #2 assigning different weights to different actions usually does not work. I’ve
done this before and compared offline metrics and seen precision go down. I’d get this working
using only your primary actions first. What are you trying to get the user to do? View something,
buy something? Use that action as the primary preference and start out with a weight of 1
using LLR. With LLR the weights are not used anyway so your data may not produce good results
with mixed actions.

A plug for the (admittedly pre-alpha) spark-itemsimilairty:
1) output from 2 can be directly ingested and will create output.
2) multiple actions can be used with cross-cooccurrence, not by guessing at weights. 
3) output has your application specific IDs preserved.
4) its about 10x faster than mapreduce and will do aways with your ID translation steps

One caveat is that your cluster machines will need lots of memory. I have 8-16g on mine.

On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.sheypak@gmail.com> wrote:

1. I do collect preferences for items using 60days sliding window. today -
60 days.
2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
view, 5 for clicking recommndation block. The idea is to give more value
for recommendations which attact visitor attention). I get ~ 20.000.000 of
lines with ~1.000.000 distinct items and ~2.000.000 distinct users
3. I do use apache pig RANK function to rank all distinct user_id
4. I do the same for item_id
5. I do join input dataset with ranked datasets and provide input to mahout
with dense interger user_id, item_id
6. I do get mahout output and join integer item_id back to get natural key
value.

step #1-2 takes ~ 40min
step #3-5 takes ~1 hour
mahout calc takes ~3hours



2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:

> This really doesn't sound right.  It should be possible to process almost a
> thousand times that much data every night without that much problem.
> 
> How are you preparing the input data?
> 
> How are you converting to Mahout id's?
> 
> Even using python, you should be able to do the conversion in just a few
> minutes without any parallelism whatsoever.
> 
> 
> 
> 
> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <serega.sheypak@gmail.com>
> wrote:
> 
>> Hi, We are trying calculate ItemSimilarity.
>> Right now we have 2*10^7 input lines. I do provide input data as raw text
>> each day to recalculate item similarities. We do get +100..1000 new items
>> each day.
>> 1. It takes too much time to prepare input data.
>> 2. It takes too much time to convert user_id, item_id to mahout ids
>> 
>> Is there any poissibility to provide data to mahout mapreduce
>> ItemSimilarity using some binary format with compression?
>> 
> 


Mime
View raw message