mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Crude distributed recommender performance / cost stats
Date Thu, 27 May 2010 15:00:55 GMT
Could do... it's trivial enough to convert the input or modify the
mapper that it's almost not worth committing and maintaining. It's all
just the stock algorithm otherwise.

I will dump this into the wiki with some other thoughts.

On Thu, May 27, 2010 at 3:57 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> Great stuff, Sean!  Code Sounds like it could go into examples with some of the other
Wikipedia stuff?
>
> Also, how about c-n-p to https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks?
>
> -Grant
>
> On May 26, 2010, at 1:32 PM, Sean Owen wrote:
>
>> The only customization needed was in the first mapper/reducer to parse
>> the particular format of the input:
>>
>> http://users.on.net/~henry/home/wikipedia.htm
>>
>> I can post the code somewhere... it's in the book too. Oh why not do
>> it here, it's pasted later.
>>
>> The rest is just the stock code from HEAD in SVN. The command line is
>> something like:
>>
>> hadoop jar mahout-core-0.4-SNAPSHOT.job
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
>> --booleanData true
>>
>> Your mileage may vary depending on the machine you run it on of course.
>>
>>
>>
>> public final class WikipediaItemIDIndexMapper extends MapReduceBase implements
>>    Mapper<LongWritable,Text,IntWritable, VLongWritable> {
>>
>>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
>>
>>  @Override
>>  public void map(LongWritable key,
>>                  Text value,
>>                  OutputCollector<IntWritable,VLongWritable> output,
>>                  Reporter reporter) throws IOException {
>>    String line = value.toString();
>>    Matcher m = NUMBERS.matcher(line);
>>    m.find();
>>    IntWritable index = new IntWritable();
>>    VLongWritable itemID = new VLongWritable();
>>    while (m.find()) {
>>      long item = Long.parseLong(m.group());
>>      itemID.set(item);
>>      index.set(idToIndex(item));
>>      output.collect(index, itemID);
>>    }
>>  }
>>
>>  static int idToIndex(long itemID) {
>>    return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32));
>>  }
>>
>> }
>>
>>
>> public final class WikipediaToItemPrefsMapper extends MapReduceBase implements
>>    Mapper<LongWritable,Text,VLongWritable,VLongWritable> {
>>
>>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
>>
>>  @Override
>>  public void map(LongWritable key,
>>                  Text value,
>>                  OutputCollector<VLongWritable,VLongWritable> output,
>>                  Reporter reporter) throws IOException {
>>    String line = value.toString();
>>    Matcher m = NUMBERS.matcher(line);
>>    m.find();
>>    VLongWritable userID = new VLongWritable(Long.parseLong(m.group()));
>>    VLongWritable itemID = new VLongWritable();
>>    while (m.find()) {
>>      itemID.set(Long.parseLong(m.group()));
>>      output.collect(userID, itemID);
>>    }
>>  }
>>
>> }
>>
>>
>> On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>>> Hey Sean,
>>>
>>>  Very cool!  Is there any custom code you used to import the link data /
>>> instructions on how to reproduce this?
>>>
>>>  -jake
>>>
>>> On May 26, 2010 8:09 AM, "Sean Owen" <srowen@gmail.com> wrote:
>>>
>>> Hi all, though the list might be interested in some recent numbers I
>>> collected on distributed recommenders, in reality, on Hadoop. I just
>>> finished running a set of recommendations based on the Wikipedia link
>>> graph, for book purposes (yeah, it's unconventional). I ran on my
>>> laptop, but it ought to be crudely representative of how it runs in a
>>> real cluster.
>>>
>>> The input is 1058MB as a text file, and contains, 130M article-article
>>> associations, from 5.7M articles to 3.8M distinct articles ("users"
>>> and "items", respectively). I estimate cost based on Amazon's North
>>> American small Linux-based instance pricing of $0.085/hour. I ran on a
>>> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
>>> is valid.
>>>
>>> In this run, I run recommendations for all 5.7M "users". You can
>>> certainly run for any subset of all users of course.
>>>
>>> Phase 1 (Item ID to item index mapping)
>>> 29 minutes CPU time
>>> $0.05
>>> 60MB output
>>>
>>> Phase 2 (Create user vectors)
>>> 88 minutes CPU time
>>> $0.13
>>> Output: 1159MB
>>>
>>> Phase 3 (Count co-occurrence)
>>> 77 hours minutes CPU time
>>> $6.54
>>> Output: 23.6GB
>>>
>>> Phase 4 (Partial multiply prep)
>>> 636 minutes
>>> $0.90
>>> Output: 24.6GB
>>>
>>> Phase 5 (Aggregate and recommend)
>>> about 600 hours
>>> about $51.00
>>> about 10GB
>>> (I estimated these rather than let it run at home for days!)
>>>
>>>
>>> Note that phases 1 and 3 may be run less frequently, and need not be
>>> run every time.
>>> But the cost is dominated by the last step, which is most of the work.
>>> I've ignored storage costs since
>>>
>>> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
>>> user recommendations. That's not bad if, say, you want to update recs
>>> for you site's 100,000 daily active users for a dollar.
>>>
>>> There are several levers one could pull internally to sacrifice
>>> accuracy for speed, but it's currently set to pretty normal values. So
>>> this is just one possibility.
>>>
>>> Now that's not terrible, but it is about 8x more computing than would
>>> be needed by a non-distributed implementation *if* you could fit the
>>> whole data set into a very large instance's memory, which is still
>>> possible at this scale but needs a pretty big instance. That's a very
>>> apples-to-oranges comparison of course; different algorithms, entirely
>>> different environments. This is about the amount of overhead I'd
>>> expect from distributing -- interesting to note how non-trivial it is.
>>>
>>>
>>> Still to-do is to actually run this on EMR at some point or a real
>>> cluster to see how well this estimate holds up.
>>> And still to-do is to make this faster.
>>>
>
>
>

Mime
View raw message