mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <ankur.g...@corp.aol.com>
Subject RE: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Date Fri, 23 Jan 2009 06:17:59 GMT


Thanks Sean, I will try this one out on my dataset and keep the list
posted on how well it worked.

Regards
-Ankur

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Friday, January 23, 2009 5:55 AM
To: mahout-dev@lucene.apache.org
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer

Here's a DataModel you could try out for your purposes; the rest
should be as I described earlier.


package org.apache.mahout.cf.taste.example;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.BooleanPrefUser;
import org.apache.mahout.cf.taste.impl.common.FastSet;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.Item;
import org.apache.mahout.cf.taste.model.User;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.ArrayList;

public final class AnkursDataModel extends FileDataModel {

  public AnkursDataModel(File ratingsFile) throws IOException {
    super(ratingsFile);
  }

  @Override
  protected void processLine(String line, Map<String,
List<Preference>> data, Map<String, Item> itemCache) {
    String[] tokens = line.split("\t");
    String userID = tokens[0];
    List<Preference> prefs = new ArrayList<Preference>(tokens.length -
1);
    for (int tokenNum = 1; tokenNum < tokens.length; tokenNum++) {
      String itemID = tokens[tokenNum];
      Item item = itemCache.get(itemID);
      if (item == null) {
        item = buildItem(itemID);
        itemCache.put(itemID, item);
      }
      prefs.add(new GenericPreference(null, item, 1.0));
      // this is a little ugly but makes it easy to reuse
FileDataModel -- pref values are tossed below
    }
    data.put(userID, prefs);
  }

  @Override
  protected User buildUser(String id, List<Preference> prefs) {
    FastSet<Object> itemIDs = new FastSet<Object>();
    for (Preference pref : prefs) {
      itemIDs.add(pref.getItem().getID());
    }
    return new BooleanPrefUser(id, itemIDs);
  }
}



On Wed, Jan 21, 2009 at 7:57 AM, Goel, Ankur <ankur.goel@corp.aol.com>
wrote:
> The input data format is typically
> User-id \t item-id \t (other information)
>
> From here it can transformed into either of the formats as they are
just
> 1 map-red away. After transformation the input data set will contain
> lines only in 1 format and not both. The data format that I use has
each
> line of the form
>
> User-id \t (Item-id1:other_info) \t ((Item-id1:other_info))...
>
> As for co-occurrence counting the way Ted mentioned, I implemented a
> map-red implementation for the same and I have found it to be pretty
> efficient, simple and effective too.
>
> Couple of tricks like only keeping top-X co-occurred items for an item
> by count and emitting only those item pairs that match a certain
> criteria have worked very well.
>
> I would like to contribute it to Mahout and filed a JIRA for the same
> https://issues.apache.org/jira/browse/MAHOUT-103
>
> I will have a patch coming soon.
>
> What I am looking for is a complimentary technique that does not
depend
> so much on co-occurrences and tries to do some sort of latent variable
> analysis to answer my query.
>
> Thanks
> -Ankur

Mime
View raw message