[ https://issues.apache.org/jira/browse/MAHOUT334?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12894552#action_12894552
]
zhao zhendong commented on MAHOUT334:

Could we load the data set into HBase for random access further?
Here is the overall data loading behavior of Liblinear:
for I = 1 : maximum iteration
shuffle the samples (note: only subject to permutation)
for J = 1 : all samples in the dataset with random order
solve the optimization problem
end
end
Normally, it will take 3 ~ 4 iteration.
By the way, one interesting paper just published in KDD 2010:
http://www.csie.ntu.edu.tw/~cjlin/papers/kdd_disk_decomposition.pdf
They split the data set into sequential trunks, and load one block of data into memory.
> Proposal for GSoC2010 (Linear SVM for Mahout)
> 
>
> Key: MAHOUT334
> URL: https://issues.apache.org/jira/browse/MAHOUT334
> Project: Mahout
> Issue Type: Task
> Affects Versions: 0.4
> Reporter: zhao zhendong
> Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: Mahoutissue3340.2.patch, Mahoutissue3340.3.patch, Mahoutissue3340.5.patch,
Utils_LibsvmFormat_Convertor.patch
>
>
> Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> Student: ZhenDong Zhao
> Student email: zhaozd@comp.nus.edu.sg
> Student Major: Multimedia Information Retrieval /Computer Science
> Student Degree: Master Student Graduation: NUS'10 Organization: Hadoop
> 0 Abstract
> Linear Support Vector Machine (SVM) is pretty useful in some applications with largescale
datasets or datasets with high dimension features. This proposal will port one of the most
famous linear SVM solvers, say, LIBLINEAR [1] to mahout with unified interface as same as
Pegasos [2] @ mahout, which is another linear SVM solver and almost finished by me. Two distinct
con tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Uniﬁed interfaces for linear
SVM classiﬁer.
> 1 Motivation
> As one of TOP 10 algorithms in data mining society [3], Support Vector Machine is very
powerful Machine Learning tool and widely adopted in Data Mining, Pattern Recognition and
Information Retrieval domains.
> The SVM training procedure is pretty slow, however, especially on the case with largescale
dataset. Nowadays, several literatures propose SVM solvers with linear kernel that can handle
largescale learning problem, for instance, LIBLINEAR [1] and Pegasos [2]. I have implemented
a prototype of linear SVM classiﬁer based on Pegasos [2] for Mahout (issue: Mahout232).
Nevertheless, as the winner of ICML 2008 largescale learning challenge (linear SVM track
(http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be incorporated
in Mahout too. Currently, LIBLINEAR package supports:
> (1) L2regularized classiﬁers L2loss linear SVM, L1loss linear SVM, and logistic
regression (LR)
> (2) L1regularized classiﬁers L2loss linear SVM and logistic regression (LR)
> Main features of LIBLINEAR are following:
> (1) Multiclass classiﬁcation: 1) onevsthe rest, 2) Crammer & Singer
> (2) Cross validation for model selection
> (3) Probability estimates (logistic regression only)
> (4) Weights for unbalanced data
> All the functionalities suppose to be implemented except probability estimates and weights
for unbalanced data (If time permitting, I would like to do so).
> 2 Unified Interfaces
> Linear SVM classiﬁer based on Pegasos package on Mahout already can provide such functionalities:
(http://issues.apache.org/jira/browse/MAHOUT232)
> (1) Sequential Binary Classiﬁcation (Twoclass Classiﬁcation), includes sequential
training and prediction;
> (2) Sequential Regression;
> (3) Parallel & Sequential MultiClassiﬁcation, includes Onevs.One and Onevs.Others
schemes.
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR are quite
similar to each other. As aforementioned, in this section I will introduce an unified interfaces
for linear SVM classiﬁer on Mahout, which will incorporate Pegasos, LIBLINEAR.
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I will introduce
them separately.
> 2.1 Data Handler
> The dataset can be stored on personal computer or on Hadoop cluster. This framework provides
high performance Random Loader, Sequential Loader for accessing largescale data.
> 2.2 Sequential Algorithms
> Sequential Algorithms will include binary classiﬁcation, regression based on Pegasos
and LIBLINEAR with uniﬁed interface.
> 2.3 Parallel Algorithms
> It is widely accepted that to parallelize binary SVM classiﬁer is hard. For multiclassiﬁcation,
however, the coarsegrained scheme (e.g. each Mapper or Reducer has one independent SVM binary
classiﬁer) is easier to achieve great improvement. Besides, cross validation for model selection
also can take advantage of such coarsegrained parallelism. I will introduce a uniﬁed interface
for all of them.
> 3 Biography:
> I am a graduating masters student in Multimedia Information Retrieval System from National
University of Singapore. My research has involved the largescale SVM classifier.
> I have worked with Hadoop and Map Reduce since one year ago, and I have dedicated lots
of my spare time to Sequential SVM (Pegasos) based on Mahout (http://issues.apache.org/jira/browse/MAHOUT232).
I have taken part in setting up and maintaining a Hadoop cluster with around 70 nodes in our
group.
> 4 Timeline:
> Weeks 14 (May 24 ~ June 18): Implement binary classifier
> Weeks 57 (June 21 ~ July 12): Implement parallel multiclass classification and Implement
cross validation for model selection.
> Weeks 8 (July 12 ~ July 16): Summit of midterm evaluation
> Weeks 9  11 (July 16 ~ August 9): Interface refactory and performance turning
> Weeks 11  12 (August 9 ~ August 16): Code cleaning, documents and testing.
> 5 References
> [1] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear:
A library for large linear classiﬁcation. J. Mach. Learn. Res., 9:18711874, 2008.
> [2] Shai ShalevShwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated subgradient
solver for svm. In ICML '07: Proceedings of the 24th international conference on Machine learning,
pages 807814, New York, NY, USA, 2007. ACM.
> [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda,
Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, ZhiHua Zhou, Michael Steinbach,
David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14(1):137,
2007.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
