mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhao zhendong <zhaozhend...@gmail.com>
Subject Re: Need comments on Proposal for linear SVM framework (Google Summer of Code 2010)
Date Sun, 21 Feb 2010 09:49:56 GMT
Hi,

See the response below:

On Sun, Feb 21, 2010 at 3:53 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> This seems like a good idea for a project, but I see two issues:
>
> a) it seems very ambitious for one summer.  This is good and bad.  Good
> because you are excited and want to accomplish something grand, bad if it
> is
> too ambitious and would cause you to officially fail while still
> accomplishing parts of good things.  Perhaps my perception is due to item
>

That's true. Do you think whether porting a LIBLINEAR to Mahout is good
enough for this proposal, I really don't know How big is big enough:) If
Yes, I can move the rest part for the future work.

(b) and you have a more limited goal than it seems.
>
> b) there doesn't seem to be a specific goal.  You say "introduce a unifying
> framework", but this is a little bit non-specific.  Do you mean to augment
> your Pegasos implementation by adding a Java-based liblinear
> implementation?  Or do you just mean to build a framework that would ALLOW
> somebody else to call each of these uniformly?
>

Yeap, I will specify this part. I meant that I will adding a Java-based
Liblinear implementation to Current package and allow users to call them
(Pegasos, Liblinear etc.) in a unifying INTERFACE (Data pre-processer,
loader and command line).


> c) Liblinear is in C++.  Mahout is committed to portability and currently
> has no C++ code.  What is your plan?
>
Using JAVA.  I find Benedikt has re-implemented a java version:
http://www.bwaldvogel.de/liblinear-java/, I want to port this code to Mahout
using Mahout Collections, etc.


> On Sat, Feb 20, 2010 at 10:00 AM, zhao zhendong <zhaozhendong@gmail.com
> >wrote:
>
> > Hi all,
> >
> > Robin told me such great chance for continuous contributing code here
> (many
> > thanks to Robin). Because I still work on Sequential SVM (Mahout-232) and
> I
> > prefer to extend it to a unified framework that incorporates some other
> > state-of-the-art linear SVM classifiers, I propose "Linear Support Vector
> > Machine (SVM) Framework based on Mahout".
> >
> > I will appreciate your any comment! :)
> >
> > Cheers,
> > Zhendong
> >
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> >  Linear Support Vector Machine (SVM) Framework based on Mahout
> >
> > — Proposal for Google Summer of CodeTM 2010
> >
> > Abstract
> >
> > Linear Support Vector Machine (SVM) framework will be introduced to
> Ma­hout
> > in this proposal. This framework provides a unified framework based on
> > Mahout
> > for diverse algorithms, such as Pegasos [2] and LIBLINEAR [1]. The
> > con­tribution has twofold: 1) Unified framework for linear SVM classifier;
> 2)
> > Introduce LIBLINEAR to Mahout.
> >
> > 1 Motivation
> >
> > Support Vector Machine is a powerful Machine Learning tool and widely
> > adopted in Data Mining, Pattern Recognition and Information Retrieval
> > communities. Recently, SVM is chose as one of Top 10 algorithms in data
> > mining [3].
> >
> > The SVM training procedure is pretty slow, especially, on the case with
> > huge
> > number of samples. Nowadays, several literatures propose linear SVM
> solvers
> > that can handle large-scale learning problem, for instance, LIBLINEAR [1]
> > and Pegasos [2]. Although I have implemented a prototype of linear SVM
> > classifier based on Pegasos [2], as the winner of ICML 2008 large-scale
> > learning challenge (linear SVM<
> > http://largescale.first.fraunhofer.de/summary/>
> >  track), LIBLINEAR [1] suppose to be incorporated in Mahout.
> > <http://largescale.first.fraunhofer.de/summary/>Currently, LIBLINEAR
> > package
> > supports:
> >
> > ·       L2-regularized classifiers [L2-loss linear SVM, L1-loss linear
> SVM,
> > and logistic regression (LR)]
> >
> > ·       L1-regularized classifiers [L2-loss linear SVM and logistic
> > regression (LR)]
> >
> > Main features of LIBLINEAR are following:
> >
> > ·       Multi-class classification: 1) one-vs-the rest, 2) Crammer &
> Singer
> >
> > ·       Cross validation for model selection
> >
> > ·       Probability estimates (logistic regression only)
> >
> > ·       Weights for unbalanced data
> >
> > Linear SVM classifier based on Pegasos package on Mahout provides such
> > function­ <http://issues.apache.org/jira/browse/MAHOUT-232>alities:
> >
> > ·       Sequential Binary Classification (Two-class Classification),
> includes
> > sequential train­ing and prediction;
> >
> > ·       Sequential Regression;
> >
> > ·       Parallel & Sequential Multi-Classification, includes
> > One-vs.-One and One-vs.-Others
> > schemes.
> >
> > Obviously, a unified framework for linear SVM classifier should be
> introduced
> > into Mahout platform.
> >
> > 2 Framework
> >
> > As aforementioned, in this section I propose a linear SVM classifier
> > framework for Ma­hout, which will incorporate Pegasos, LIBLINEAR.
> > <http://issues.apache.org/jira/browse/MAHOUT-228>The whole picture of
> > framework is illustrated in Figure 1:
> >
> > Apparently, this framework has two main parts: 1) Data accessing and
> > pre-processing; 2) Algorithms. I will introduce them separately.
> >
> > 2.1 Data Processing Layer
> >
> > The dataset can be stored on personal computer or on Hadoop cluster. This
> > framework provides high performance Random Loader, Sequential Loader for
> > accessing large-scale data. Such loaders support both sequential vector,
> > Gson
> > format and raw dataset format ( as same as SVMlight 2 and Libsvm 3).
> >
> >
> > [image:
> >
> >
> ?ui=2&view=att&th=126ec618035bdf9a&attid=0.1&disp=attd&realattid=ii_126ec618035bdf9a&zw]
> >
> > Figure 1: The framework of linear SVM on Mahout
> >
> > 2.2 Sequential Algorithms
> >
> > Sequential Algorithms will include binary classification, regression based
> > on
> > Pegasos and LIBLINEAR with unified interface.
> >
> > 2.3 Parallel Algorithms
> >
> > It is widely accepted that to parallelize binary SVM classifier is hard.
> For
> > multi-classification, however, the coarse-grained scheme (e.g. each Mapper
> > or
> > Reducer has one independent SVM binary classifier) is easier to achieve
> > great
> > improvement. Besides, cross validation for model selection also can take
> > advantage of such coarse-grained parallelism. I will introduce a unified
> > interface for all of them.
> >
> > References
> >
> > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and
> Chih-Jen
> > Lin. Liblinear: A library for large linear classification. J. Mach. Learn.
> > Res., 9:1871–1874, 2008.
> >
> > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
> > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the
> 24th
> > international conference on Machine learning, pages 807–814, New York,
> NY,
> > USA, 2007. ACM.
> >
> > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
> > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
> > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
> > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.
> > --
> > -------------------------------------------------------------
> >
> > Zhen-Dong Zhao (Maxim)
> >
> > <><<><><><><><><><>><><><><><>>>>>>
> >
> > Department of Computer Science
> > School of Computing
> > National University of Singapore
> >
> > >>>>>>><><><><><><><><<><>><><<<<<<
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message