mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sisir Koppaka <sisir.kopp...@gmail.com>
Subject Re: Reg. Netflix Prize Apache Mahout GSoC Application
Date Mon, 22 Mar 2010 20:53:12 GMT
Hi,
Thanks a lot for taking time out to reply. I understand that it's important
to get the proposal right - that's why I wanted to bounce off all
possibilites as far as Netflix is concerned - the methods that I've worked
with before, on this list, and see what would be of priority interest to the
team. If global effects, and temporal SVD would be of interest then I'd
incorporate that into my final proposal accordingly. On the other hand, I've
read that RBM is something the team is interested in, so I could also
implement a very good performing(approximately 0.91 RMSE) RBM for Netflix,
as the GSoC project. I'd like to know which of the Netflix algorithms the
Mahout team would like to see implemented first.

Depending on the feedback, I'll prepare the final proposal. I'll definitely
work with the code now and post any queries that I get on the list.

Thanks a lot,
Best regards,
Sisir Koppaka

On Tue, Mar 23, 2010 at 1:22 AM, Robin Anil <robin.anil@gmail.com> wrote:

> Hi Sisir,
>          I am currently on vacation. So wont be able to review your
> proposal fully. But from the looks of it what I would suggest you is to
> target a somewhat lower and practical proposal. Trust me converting these
> algorithms to map/reduce is not as easy as it sounds and most of the time
> you would spend in debugging your code. Your work history is quite
> impressive but whats more important here is getting your proposal right.
> Sean has written most of the recommender code of Mahout and would be best
> to
> give you feedback as he has tried quite a number of approaches to
> recommenders on map/reduce and knows very well, some of the constraints of
> the framework. Feel free to explore the current Mahout recommenders code
> and
> ask on the list if you find anything confusing. But remember you are trying
> to reproduce some of the cutting edge work in Recommendations over 2 years
> in a span of 10 weeks :) so stop and ponder over the feasibility. If you
> still are good to go then prolly, you need to demonstrate something in
> terms
> of code during the proposal period(which is optional).
>
> Don't take this in the wrong way, its not meant to demotivate you. If we
> can
> get this into mahout, I am sure noone here would be objecting to it. So
> your
> good next step would be read, explore, think, discuss.
>
> Regards
> Robin
>
>
> On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka <sisir.koppaka@gmail.com
> >wrote:
>
> > Dear Robin & the Apache Mahout team,
> > I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> > contributed to open source projects like FFmpeg earlier(Repository diff
> > links are here<
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> > >and
> > here<
> >
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> > >
> > ), and I am very interested to work on a project for Apache Mahout this
> > year(the Netflix algorithms project, to be precise - mentored by Robin).
> > Kindly let me explain my background so that I can make myself relevant in
> > this context.
> >
> > I've done research work in meta-heuristics, including proposing the
> > equivalents of local search and mutation for quantum-inspired algorithms,
> > in
> > my paper titled "*Superior Exploration-Exploitation Balance With
> > Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> > paper at GECCO 2010. We(myself and a friend - it was an independent
> work),
> > hope to send an expanded version of the communication to a journal in the
> > near future. For this project, our language of implementation was in
> > Mathematica, as we needed the combination of functional paradigms and
> > available mathematically sound resources(like biased random number
> > generation, simple linear programming functions etc.) as well as rapid
> > prototyping ability.
> >
> > I have earlier interned in GE Research in their Computing and Decision
> > Sciences Lab<
> > http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> > >last
> > year, where I worked on machine learning techniques for large-scale
> > databases - specifically on the Netflix Prize itself. Over a 2 month
> > internship we rose from 1800 to 409th position on the Leaderboard, and
> had
> > implemented at least one variant of each of the major algorithms. The
> > contest ended at the same time as the conclusion of our internship, and
> the
> > winning result was the combination of multiple variants of our
> implemented
> > algorithms.
> >
> > Interestingly, we did try to use Hadoop and the Map-Reduce model for the
> > purpose based on a talk from a person from Yahoo! who visited us during
> > that
> > time. However, not having access to a cluster proved to be an impedance
> for
> > fast iterative development. We had one machine of 16 cores, so we
> developed
> > a toolkit in C++ that could multiprocess upto 16 threads(data input
> > parallelization, rather than modifying the algorithms to suit the
> > Map-Reduce
> > model), and implemented all our algorithms using the same toolkit.
> > Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other
> > variants like the Paterek SVD, and the temporal SVD++ too) were the major
> > algorithms that we implemented. Some algorithms had readily available
> open
> > source code for the Netflix Prize, like NSVD1, so we used them as well.
> We
> > also worked on certain regression schemes that could improve prediction
> > accuracy like kernel-ridge regression, and it's optimization.
> >
> > Towards the end, we also attempted to verify the results of the infamous
> > paper that showed that IMDB-Netflix correlation could destroy privacy,
> and
> > identify users. We would import IMDB datasets, and put them into a
> database
> > and then correlate the IMDB entries to Netflix(we matched double the
> number
> > of movies that the paper mentioned), and then verify the results. We also
> > identified genre-wise trends and recorded them as such. Unfortuantely,
> the
> > paper resulted in a libel case, wherein Netflix surrendered it's rights
> to
> > hold future Prizes of this kind in return for withdrawal of charges. The
> > case effectively closed the possibility of Netflix or any other company
> > releasing similar datasets to the public pending further advances in
> > privacy
> > enforcement techniques, leaving the Netflix database as the largest of
> it's
> > kind.
> >
> > Naturally, it is very interesting if there is an opportunity to implement
> > the same for Hadoop, since Netflix is the largest database of it's kind
> and
> > this would be of multiple uses - as a kickstart, tutorial, and as a
> > performance testing(using an included segment of the total database) tool
> > for grids. In addition, the Netflix and IMDB framework base code could be
> > incredibly useful as a prototyping tool for algorithm designers who want
> to
> > design better machine learning algorithms/ensembles for large-scale
> > databases. The lack of a decent scalable solution for the Netflix Prize
> so
> > that people can learn from it/add to it, is also a major disappointment
> > that
> > this proposal hopes to correct.
> >
> > Specifically, the challenge I am personally looking forward to is to
> > implement some of these algorithms using the Map-Reduce model, which is
> > what
> > I missed out last time. I am looking forward to:
> > 1. Account for the 12 global effects. The global effects were 12
> > corrections
> > made in the dataset to account for time, multiple votes on a single day
> > etc., and these alone give a major boost to the accuracy of the
> > predictions.
> > 2. Implement at least 1 kNN based approach. Most of them differ in their
> > parameters, or in their choice of distance definitions(Pearson, Cosine,
> > Euclidean etc.)
> > 3. Implement at least 1 SVD based temporal approach. While Mahout already
> > possesses SVD implementations - slight variations of which could
> replicate
> > multiple SVD approaches for the Netflix Prize, the temporal SVD++ has in
> my
> > past experience been a very promising candidate that contributes a lot of
> > perpendicular insight into the predictions.
> > 4. RBM implementation. RBM's were found to offer distinct insights into
> the
> > predictions, and Mahout doesn't currently possess one implementation
> yet(as
> > Ankur C. Goel mentioned on this list a few days ago), so this is a
> > no-brainer for a must-do implementation for the Netflix Prize GSoC
> project.
> > 5. Implementing at least 1 regression scheme relevant to the Netflix
> Prize
> > as part of the boiler plate code that would be required for this project
> to
> > facilitate further additions of algorithm variants. I'd like to do
> > kernel-ridge regression, since it has, again, provided decent results on
> > the
> > Netflix dataset compared to other options. I need to discuss this part
> with
> > my mentor/the Apache Mahout team because while multiple regression
> schemes
> > are required for this project, I am not sure if it is necessary to
> > implement
> > this as a Map-Reduce based scheme, *at least within the scope of the GSoC
> > project.* This is because it is not very compute-intensive, at least not
> as
> > much as the central algorithms for the Netflix Prize, and wouldn't be a
> > critical limiting factor to successful completion of an 8% improvement
> > project goal.
> >
> > The idea is that 1,2,5 would not take more than two weeks, with 3 taking
> > another two weeks and the rest of the time would be spent on 4. A
> > demonstrated 8% improvement over Cinematch on the Netflix dataset would
> be
> > the project goal.
> >
> > There are plenty of references about the Grand Prize solution - Yehuda
> > Koren's final summary<
> > http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf>is
> > a great starting point. However, when I last worked on the problem
> > this
> > wasn't available, so I referred to papers by Arik Paterek and some of
> > Bellkor's earlier papers.
> >
> > Looking forward to a discussion about the proposal,
> > Thanks a lot,
> > Best regards,
> > Sisir Koppaka
> >
> > --
> > SK
> >
>



-- 
SK

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message