mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sisir Koppaka (JIRA)" <>
Subject [jira] Updated: (MAHOUT-375) [GSOC] Restricted Boltzmann Machines in Apache Mahout
Date Sun, 11 Jul 2010 19:29:50 GMT


Sisir Koppaka updated MAHOUT-375:

    Attachment: MAHOUT-375.patch

Compiles cleanly. Doesn't do everything it's supposed to do due to misplaced function calls.
Fixing that, and will put that up after testing in a few hours.

> [GSOC] Restricted Boltzmann Machines in Apache Mahout
> -----------------------------------------------------
>                 Key: MAHOUT-375
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Sisir Koppaka
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-375.diff, MAHOUT-375.patch, MAHOUT-375.patch, MAHOUT-375.patch,
> Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses issue Mahout-329)
> Student Name: Sisir Koppaka
> Student E-mail:
> Organization/Project:Assigned Mentor:
> Abstract
> This is a proposal to implement Restricted Boltzmann Machines in Apache Mahout as a part
of Google Summer of Code 2010. The demo for the code would be built on the Netflix dataset.
> 1 Introduction
> The Grand Prize solution to the Netflix Prize offered several new lessons in the application
of traditional machine learning techniques to very large scale datasets. The most significant
among these were the impact of temporal models, the remarkable contribution of RBM's to the
solution in the overall model, and the great success in applying ensemble models to achieve
superior predictions. The present proposal seeks to implement a conditional factored RBM[4]
in Apache Mahout as a project under Google Summer of Code 2010.
> 2 Background
> The Netflix dataset takes the form of a sparse matrix of a N X M ratings that N users
assign to M movies. Matrix decompositions such as variants of Singular Value Decompositions(SVDs)
form the first type of methods applied. This has also induced several recent works in applied
mathematics relevant to the Netflix Prize, including [1, 2]. Another genre of techniques have
been k-nearest neighbour approaches - user-user, movie-movie and using different
> distance measures such as Pearson and Cosine. The third set of techniques that offers
arguably the most divergent predictions that aid in the increase in prediction RMSE are RBM
and it's variants.
> [4] demonstrates the algorithm that the author proposes to implement this summer in Apache
Mahout. Parallelization can be done by updating all the hidden units, followed by the visible
units in parallel, due to the conditional independence of the hidden units, given a visible
binary indicator matrix. Rather than implementing a naive RBM, the conditional factored RBM
is chosen due to it's useful combination of effectiveness and speed. Minor variations, in
any case, could be developed later with little difficulty.
> The training data set consists of nearly 100 million ratings from 480,000 users on 17,770
movie titles. As part of the training data, Netflix also pro- vides validation data(called
the probe set), containing nearly 1.4 million rat- ings. In addition to the training and validation
data, Netflix also provides a test set containing 2.8 million user/movie pairs(called the
qualifying set) whose ratings were previously withheld, but have now been released post the
conclusion of the Prize.
> 3 Milestones 
> 3.1	April 26-May 24
> Community Bonding Period Certain boilerplate code for the Netflix dataset exists at
However, this code is non-distributed and is unrelated to Hadoop. Certain parts of this code,
like the file read-in based on Netflix format will be reused to match the processed Netflix
dataset file linked below.
> Test out any of the already-implemented Mahout algorithms like SVD or k-Means on the
whole dataset to make sure that everything works as ad- vertised. Make a note of testing time.
If testing time is very large, then make a 10% training set, and use the 10% probe, which
already exists as a standardized Netflix Prize community contribution. This is only so that
iterations can be faster/a multiple node Hadoop installation need not al- ways be required.
Work on the map-reduce version of RBM and evaluate if parallelization beyond the hidden units
and visible units alternate computa- tion can be implemented. Get the community's approval
for the map-reduce version of RBM, and then proceed.
> 3.2	May 24-July 12 Pre-midterm	
> Implementation time! Write code, test code, rewrite code.
> Should have working code with decent predictions by end of this segment.
> Design details	
> The RBM code would live at org.apache.mahout.classifier.rbm. would need
to be written to support the RBM similar to that in discriminative. An equivalent of
would not be required because of the pre-written Netflix read-in code as mentioned above., and would be reused as-is from
> algorithm would contain the actual conditional factored RBM algorithm. common would contain
the relevant code common to various files in algo- rithm. mapreduce.rbm would contain the
driver, mapper and reducer for the parallelized updating of the hidden units layer, followed
by the visible units, and appropriate refactored code would be placed in mapreduce.common.
> The algorithm would be implemented generically, and the demo would be on the Netflix
> 3.3	July 12-July 31 Post-midterm	
> If testing was on the 10% set, run multiple times on the whole dataset and ensure results
match. Test a two-method ensemble of SVD(already in Mahout) and RBM and confirm that RBM offers
a unique perpendicular dimensionality to the problem. Make sure unit tests are all in.
> Test on Netflix dataset linked above and prepare for demo.
> 3.4	July 31-August 16 Pencils down 
> Refactor, tune and clean code. Final demo done. Write documentation and add a wiki page.
> 4	About Myself
> Im a 19-year old student hailing from the beautiful, sea-hugging, coastal city of Visakhapatnam
in Andhra Pradesh, South India. In 2006, I was one of
> the only 110 students in the country to be awarded a scholarship from the Indian Institute
of Science and the Department of Science and Technology, Government of India, under the KVPY
programme and I attended their summer camp that year.
> I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last summer. I
worked on custom-toolkit in C++ that implemented various Netflix algorithms and operated using
data parallelization for some of the more lenghty algorithms like user-user kNN. Our team
stood at rank 409 at the end of the two-month internship, when the Grand Prize was awarded.
> I have also published independent work in GECCO 2010[3]. GECCO is ACM SIGEVO's annual
conference, and is ranked 11th out of 701 interna- tional conferences in AI, ML, Robotics,
and HCI based on it's impact factor.
> I have also contributed code to FFmpeg, and was part of my Hall of Residence's award-winning
Java-based Open Soft project team that we have now open sourced. I am also an avid quizzer
and have won several prestigious quizzes during my schooling days. I also got the 4th rank
in the regional qualifications for the Indian National Mathematics Olympiad.
> References
> [1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal Matrix Completion.
Arxiv, 2009.
> [2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy Entries. Arxiv,
> [3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with Quantum-Inspired
Hadamard Walks. Proceedings of the 12th Annual Conference on Genetic and Evolutionary computation
- GECCO '10 to appear, 2010.
> [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines for Collaborative
Filtering. Proceedings of the 24th International Conference on Machine Learning, Corvallis,
OR,2007, 6, 2007.
> Open Source Java Project -
> FFmpeg patches 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message