mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhao zhendong <>
Subject Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010
Date Fri, 12 Mar 2010 10:48:49 GMT
 Hi all,
The updated proposal for GSoC 2010 is as follows, any comment is welcome.
Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student
e-mail: Student Major: Multimedia Information
Retrieval /Computer ScienceStudent Degree: Master        Student Graduation:
NUS’10           Organization: Hadoop

0 Abstract

Linear Support Vector Machine (SVM) is pretty useful in some applications
with large-scale datasets or datasets with high dimension features. This
proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR
[1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
is another linear SVM solver and almost finished by me. Two distinct
con­tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified
interfaces for linear SVM classifier.

1 Motivation

As one of TOP 10 algorithms in data mining society [3], Support Vector
Machine is very powerful Machine Learning tool and widely adopted in Data
Mining, Pattern Recognition and Information Retrieval domains.

The SVM training procedure is pretty slow, however, especially on the case
with large-scale dataset. Nowadays, several literatures propose SVM solvers
with linear kernel that can handle large-scale learning problem, for
instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
Nevertheless, as the winner of ICML 2008 large-scale learning challenge
(linear SVM <>track (, LIBLINEAR [1] suppose to be
incorporated in Mahout too. Currently, LIBLINEAR package supports:


   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
   logistic regression (LR)

   L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)

Main features of LIBLINEAR are following:


   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer

   Cross validation for model selection

   Probability estimates (logistic regression only)

   Weights for unbalanced data

*All the functionalities suppose to be implemented except probability
estimates and weights for unbalanced data* (If time permitting, I would like
to do so).

2 Unified Interfaces

Linear SVM classifier based on Pegasos package on Mahout already can provide
such functionalities: *(*


   Sequential Binary Classification (Two-class Classification), includes
   sequential training and prediction;

   Sequential Regression;

   Parallel & Sequential Multi-Classification, includes One-vs.-One and
   One-vs.-Others schemes.

Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR
are quite similar to each other. As aforementioned, in this section I will
introduce an unified interfaces for linear SVM classifier on Mahout, which
will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
illustrated in Figure 1:

The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms.
I will introduce them separately.

*2.1 Data Handler*

The dataset can be stored on personal computer or on Hadoop cluster. This
framework provides high performance Random Loader, Sequential Loader for
accessing large-scale data.

 Figure 1: The framework of linear SVM on Mahout

*2.2 Sequential Algorithms*

Sequential Algorithms will include binary classification, regression based on
Pegasos and LIBLINEAR with unified interface.

*2.3 Parallel Algorithms*

It is widely accepted that to parallelize binary SVM classifier is hard. For
multi-classification, however, the coarse-grained scheme (e.g. each Mapper or
Reducer has one independent SVM binary classifier) is easier to achieve great
improvement. Besides, cross validation for model selection also can take
advantage of such coarse-grained parallelism. I will introduce a unified
interface for all of them.

3 Biography:

I am a graduating masters student in Multimedia Information Retrieval System
from National University of Singapore. My research has involved the
large-scale SVM classifier.

I have worked with Hadoop and Map Reduce since one year ago, and I have
dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout.

*(* I have taken part in
setting up and maintaining a Hadoop cluster with around 70 nodes in our

4 Timeline:

Weeks 1-4: Implement binary classifier

Weeks 5-6: Implement parallel multi-class classification and cross
validation for model selection

Weeks 7-8: Interface re-factory and performance turning

Weeks 9-10: Clean up/ preparing for end of GSoC


[1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. Liblinear: A library for large linear classification. J. Mach. Learn.
Res., 9:1871–1874, 2008.

[2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the 24th
international conference on Machine learning, pages 807–814, New York, NY,
USA, 2007. ACM.

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.


Zhen-Dong Zhao (Maxim)


Department of Computer Science
School of Computing
National University of Singapore


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message