hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "NewsPersonalizationSystem" by udanax
Date Fri, 11 Jan 2008 15:51:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/NewsPersonalizationSystem

New page:
[[TableOfContents(4)]]
----
= News Personalization System =
== Initial Contributors ==

 * [:udanax:Edward Yoon] (R&D center, NHN corp.)
== Algorithm Overview ==

 * Obtain a list of candidate stories
 * For each story:
  * Obtain 3 scores (y1, y2, y3)
  * Final score = ∑ wi*yi
 * User clustering algorithms (Minhash (y1), PLSI (y2))
  * Score ~ number of times this story was clicked by other users in your cluster
 * Story-story covisitation (y3)
  * Score ~ number of times this story was co-clicked with other stories in your recent click
history
 * User clustering done offline as batch process
  * Can be run every day or 2-3 times a week
 * Cluster-story counts maintained in real time
 * New users
  * May not be clustered
  * Rely on co-visitation to generate recommendations
----
== NPS Architecture ==

{{{
                                             +----------------------------+
                                             |  News Frontend Web server  |
                                             +-----+---------------+------+
                                                   |       ↑       |
                                                Rank     Ranks   Click
                                                Request  Reply   Notify
                                                   ↓       |       ↓
                          +--------------------------+--------------------+
                          |                    Personalization Servers    |
                          +----------------------------------------+------+
                              ↑                       ↑            |
                     Reads user profile          Read Stats  Update Stats 
                              |                       |            ↓
    +-------------------------+-----+    +------------+-------------------+
    |       Hbase:UserTable         |    |        Hbase:StoryTable        |
    | (user clusters, click history |    | (Cluster + covisitation count) |
    +-------------------------------+    +--------------------------------+
                              ↑
+-----------------------------+-------------------------------------------+
|                    Hadoop:User clustering MapReduce                     |
+-------------------------------------------------------------------------+
}}}
----
== Clustering Algorithms ==
=== User clustering - MinHash ===

 * Input: User and his clicked stories
  * ~+S+~,,u,, = {s^u^,,1,, , s^u^,,2,, , ... , s^u^,,m}
 * User similarity = | S,,u1,, I S,,u2 | / |S,,u1, Y S,,u2,, |
 * Output: User clusters. 
 * Similar users belong to same cluster
==== MinHash ====

 * Randomly permute the universe of clicked stories
  * {s^u^,,1,, , s^u^,,2,, , ... , s^u^,,m} = {s^'^,,1,, , s^'^,,2,, , ... , s^'^,,m}
 * MH(u) = min(s^u^,,j,,) min defined by permutation
  * P{MH(u,,1,,) = MH(u,,2,,)} = | S,,u1,, I S,,u2 | / |S,,u1, Y S,,u2,, |
 * Pseudo-random permutation
  * Compute hash for each story and treat hash-value as permutation index 
 * Treat !MinHash value as !ClusterId
 * Probabilistic clustering

=== Clustering - PLSI Algorithm ===

 * Learning (done offline)
  * ML estimation 
   * Learn model parameters that maximize the likelihood of the sample data
   * Ouput = P[zj|u]’s P[s|zj]’s
  * P[zj|u]’s lead to a soft clustering of users
 * Runtime: we only use P[zj|u]’s  and ignore P[s|zj]’s

=== Covisitation count ===

 * For each story si store the covisitation counts with other stories c(si, sj )
 * Candidate story: sk
 * User history: s1,…, sn 
 * score (si, sj ) = c(si, sj )/∑m c(si, sm )
 * total_score(sk) = ∑n score(sk, sn )
----
== References ==
 * Google News Personalization: Scalable Online Collaborative Filtering
 * Bigtable paper: OSDI 2006

Mime
View raw message