mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhao zhendong (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAHOUT-334) Proposal for GSoC2010 (Linear SVM for Mahout)
Date Sat, 14 Aug 2010 08:59:17 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhao zhendong updated MAHOUT-334:
---------------------------------

    Attachment: Mahout-issue334.patch

Documentation and Code clean.

>>>>>>>>>>>>>
Conclusion 
>>>>>>>>>>>>>
Implementation of Liblinear on Mahout teaches us a course. 

1) Current parallel framework is not suitable for Liblinear:
Different from the implementation of Pegasos (Mahout-232), Liblinear requires whole data set
to optimize the primary or dual equation. The prons and cons are listed as follows:
Prons:
All classifiers in Liblinear are quite stable. Whenever training a data set with one certain
classifier, we always got extract same object value and accuracy.

Cons:
Requirement for whole data set could limit the usability of Liblinear, especially,  while
we need to train a classifier on an extremely large scale data set. 

In the case of Liblinear, which needs all training samples in each training process, it's
difficult to leverage the MapReduce framework to improve the performance, include the multi-class
classification and parameter selection.

Although I've done a parallel classifier for multi-class problem using the same framework
with Pegasos (Mahout-232), I believe parallel Liblinear could be useless within such framework.

2)  L1-regularized classifiers L2-loss linear SVM and logistic regression (LR) are not suitable
for large scale data set:
Both of them involve transpose operation on whole data set matrix ( column denotes features,
one row one sample). In this sense, it's almost impossible can be applied to large - scale
data. 

3) Future work:
The next scope of Liblinear on Mahout is to focus on Sequential Learning with Large-scale
data sets. One interesting paper, which just published in KDD 2010: 
http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf

> Proposal for GSoC2010 (Linear SVM for Mahout)
> ---------------------------------------------
>
>                 Key: MAHOUT-334
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-334
>             Project: Mahout
>          Issue Type: Task
>    Affects Versions: 0.4
>            Reporter: zhao zhendong
>            Assignee: Robin Anil
>             Fix For: 0.4
>
>         Attachments: Mahout-issue334-0.2.patch, Mahout-issue334-0.3.patch, Mahout-issue334-0.5.patch,
Mahout-issue334.patch, Mahout-issue334.patch, Utils_LibsvmFormat_Convertor.patch
>
>
> Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> Student: Zhen-Dong Zhao
> Student e-mail: zhaozd@comp.nus.edu.sg
> Student Major: Multimedia Information Retrieval /Computer Science
> Student Degree: Master        Student Graduation: NUS'10           Organization: Hadoop
> 0 Abstract
> Linear Support Vector Machine (SVM) is pretty useful in some applications with large-scale
datasets or datasets with high dimension features. This proposal will port one of the most
famous linear SVM solvers, say, LIBLINEAR [1] to mahout with unified interface as same as
Pegasos [2] @ mahout, which is another linear SVM solver and almost finished by me. Two distinct
con tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces for linear
SVM classifier.
> 1 Motivation
> As one of TOP 10 algorithms in data mining society [3], Support Vector Machine is very
powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and
Information Retrieval domains.
> The SVM training procedure is pretty slow, however, especially on the case with large-scale
dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle
large-scale learning problem, for instance, LIBLINEAR [1] and Pegasos [2]. I have implemented
a prototype of linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
Nevertheless, as the winner of ICML 2008 large-scale learning challenge (linear SVM track
(http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be incorporated
in Mahout too. Currently, LIBLINEAR package supports:
>   (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and logistic
regression (LR)
>   (2) L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)
> Main features of LIBLINEAR are following:
>   (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
>   (2) Cross validation for model selection
>   (3) Probability estimates (logistic regression only)
>   (4) Weights for unbalanced data
> All the functionalities suppose to be implemented except probability estimates and weights
for unbalanced data (If time permitting, I would like to do so).
> 2 Unified Interfaces
> Linear SVM classifier based on Pegasos package on Mahout already can provide such functionalities:
(http://issues.apache.org/jira/browse/MAHOUT-232)
>   (1) Sequential Binary Classification (Two-class Classification), includes sequential
training and prediction;
>   (2) Sequential Regression;
>   (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and One-vs.-Others
schemes.
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite
similar to each other. As aforementioned, in this section I will introduce an unified interfaces
for linear SVM classifier on Mahout, which will incorporate Pegasos, LIBLINEAR. 
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I will introduce
them separately.
> 2.1 Data Handler
> The dataset can be stored on personal computer or on Hadoop cluster. This framework provides
high performance Random Loader, Sequential Loader for accessing large-scale data.
> 2.2 Sequential Algorithms
> Sequential Algorithms will include binary classification, regression based on Pegasos
and LIBLINEAR with unified interface.
> 2.3 Parallel Algorithms
> It is widely accepted that to parallelize binary SVM classifier is hard. For multi-classification,
however, the coarse-grained scheme (e.g. each Mapper or Reducer has one independent SVM binary
classifier) is easier to achieve great improvement. Besides, cross validation for model selection
also can take advantage of such coarse-grained parallelism. I will introduce a unified interface
for all of them.
> 3 Biography:
> I am a graduating masters student in Multimedia Information Retrieval System from National
University of Singapore. My research has involved the large-scale SVM classifier.
> I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots
of my spare time to Sequential SVM (Pegasos) based on Mahout (http://issues.apache.org/jira/browse/MAHOUT-232).
I have taken part in setting up and maintaining a Hadoop cluster with around 70 nodes in our
group.
> 4 Timeline:
> Weeks 1-4 (May 24 ~ June 18): Implement binary classifier 
> Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification and Implement
cross validation for model selection. 
> Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation
> Weeks 9 - 11 (July 16 ~ August 9):  Interface re-factory and performance turning
> Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing. 
> 5 References
> [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear:
A library for large linear classification. J. Mach. Learn. Res., 9:1871-1874, 2008.
> [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient
solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning,
pages 807-814, New York, NY, USA, 2007. ACM.
> [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda,
Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach,
David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37,
2007.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message