Hi all,
The updated proposal for GSoC 2010 is as follows, any comment is welcome.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Title/Summary:
Linear SVM Package (LIBLINEAR) for Mahout Student: ZhenDong Zhao Student
email: zhaozd@comp.nus.edu.sg Student Major: Multimedia Information
Retrieval /Computer ScienceStudent Degree: Master Student Graduation:
NUS’10 Organization: Hadoop
0 Abstract
Linear Support Vector Machine (SVM) is pretty useful in some applications
with largescale datasets or datasets with high dimension features. This
proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR
[1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
is another linear SVM solver and almost finished by me. Two distinct
contributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed
interfaces for linear SVM classiﬁer.
1 Motivation
As one of TOP 10 algorithms in data mining society [3], Support Vector
Machine is very powerful Machine Learning tool and widely adopted in Data
Mining, Pattern Recognition and Information Retrieval domains.
The SVM training procedure is pretty slow, however, especially on the case
with largescale dataset. Nowadays, several literatures propose SVM solvers
with linear kernel that can handle largescale learning problem, for
instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
linear SVM classiﬁer based on Pegasos [2] for Mahout (issue: Mahout232).
Nevertheless, as the winner of ICML 2008 largescale learning challenge
(linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be
incorporated in Mahout too. Currently, LIBLINEAR package supports:

L2regularized classiﬁers L2loss linear SVM, L1loss linear SVM, and
logistic regression (LR)

L1regularized classiﬁers L2loss linear SVM and logistic regression (LR)
Main features of LIBLINEAR are following:

Multiclass classiﬁcation: 1) onevsthe rest, 2) Crammer & Singer

Cross validation for model selection

Probability estimates (logistic regression only)

Weights for unbalanced data
*All the functionalities suppose to be implemented except probability
estimates and weights for unbalanced data* (If time permitting, I would like
to do so).
2 Unified Interfaces
Linear SVM classiﬁer based on Pegasos package on Mahout already can provide
such functionalities: *(http://issues.apache.org/jira/browse/MAHOUT232)*

Sequential Binary Classiﬁcation (Twoclass Classiﬁcation), includes
sequential training and prediction;

Sequential Regression;

Parallel & Sequential MultiClassiﬁcation, includes Onevs.One and
Onevs.Others schemes.
Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR
are quite similar to each other. As aforementioned, in this section I will
introduce an unified interfaces for linear SVM classiﬁer on Mahout, which
will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
illustrated in Figure 1:
The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms.
I will introduce them separately.
*2.1 Data Handler*
The dataset can be stored on personal computer or on Hadoop cluster. This
framework provides high performance Random Loader, Sequential Loader for
accessing largescale data.
Figure 1: The framework of linear SVM on Mahout
*2.2 Sequential Algorithms*
Sequential Algorithms will include binary classiﬁcation, regression based on
Pegasos and LIBLINEAR with uniﬁed interface.
*2.3 Parallel Algorithms*
It is widely accepted that to parallelize binary SVM classiﬁer is hard. For
multiclassiﬁcation, however, the coarsegrained scheme (e.g. each Mapper or
Reducer has one independent SVM binary classiﬁer) is easier to achieve great
improvement. Besides, cross validation for model selection also can take
advantage of such coarsegrained parallelism. I will introduce a uniﬁed
interface for all of them.
3 Biography:
I am a graduating masters student in Multimedia Information Retrieval System
from National University of Singapore. My research has involved the
largescale SVM classifier.
I have worked with Hadoop and Map Reduce since one year ago, and I have
dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout.
*(http://issues.apache.org/jira/browse/MAHOUT232).* I have taken part in
setting up and maintaining a Hadoop cluster with around 70 nodes in our
group.
4 Timeline:
Weeks 14: Implement binary classifier
Weeks 56: Implement parallel multiclass classification and cross
validation for model selection
Weeks 78: Interface refactory and performance turning
Weeks 910: Clean up/ preparing for end of GSoC
References
[1] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen
Lin. Liblinear: A library for large linear classiﬁcation. J. Mach. Learn.
Res., 9:1871–1874, 2008.
[2] Shai ShalevShwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
estimated subgradient solver for svm. In ICML ’07: Proceedings of the 24th
international conference on Machine learning, pages 807–814, New York, NY,
USA, 2007. ACM.
[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
ZhiHua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.

ZhenDong Zhao (Maxim)
<><<><><><><><><><>><><><><><>>>>>>
Department of Computer Science
School of Computing
National University of Singapore
>>>>>>><><><><><><><><<><>><><<<<<<
