Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 26418 invoked from network); 10 Feb 2010 15:13:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2010 15:13:56 -0000 Received: (qmail 86092 invoked by uid 500); 10 Feb 2010 15:13:55 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 86033 invoked by uid 500); 10 Feb 2010 15:13:55 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 86023 invoked by uid 99); 10 Feb 2010 15:13:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 15:13:55 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 15:13:53 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BE6FF234C48D for ; Wed, 10 Feb 2010 07:13:31 -0800 (PST) Message-ID: <1362762253.181261265814811778.JavaMail.jira@brutus.apache.org> Date: Wed, 10 Feb 2010 15:13:31 +0000 (UTC) From: "zhao zhendong (JIRA)" To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos In-Reply-To: <1764179497.1261975350214.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832022#action_12832022 ] zhao zhendong commented on MAHOUT-232: -------------------------------------- Hi Sean, For Mahout-232, I suppose to finished code style checking *by this end of week (Revised Based on Robin's Comments)*. I don't know whether it could be pushed in 0.3. But I just wanna you guys know the progress of this issue. Cheers, Zhendong -- ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore > Implementation of sequential SVM solver based on Pegasos > -------------------------------------------------------- > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.4 > Reporter: zhao zhendong > Fix For: 0.4 > > Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch > > > After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. > Sequential SVM based on Pegasos. > Maxim zhao (zhaozhendong at gmail dot com) > ------------------------------------------------------------------------------------------- > Currently, this package provides (Features): > ------------------------------------------------------------------------------------------- > 1. Sequential SVM linear solver, include training and testing. > 2. Support general file system and HDFS right now. > 3. Supporting large-scale data set training. > Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch > the certain size (e.g. max iteration) of samples to memory. > For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, > as the result, this package only random load 10,000 samples to memory. > 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing. > 5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework. > 6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version). > 7. Supporting Regression. > ------------------------------------------------------------------------------------------- > TODO: > ------------------------------------------------------------------------------------------- > 1. Multi-classification Probability Prediction > 2. Performance Testing > ------------------------------------------------------------------------------------------- > Usage: > ------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Classification: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ Training: @@ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SVMPegasosTraining.java > The default argument is: > -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model > ~~~~~~~~~~~~~~~~~~~~~~ > @ For the case that training data set on HDFS:@ > ~~~~~~~~~~~~~~~~~~~~~~ > 1 Assure that your training data set has been submitted to hdfs > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > 2 revise the argument: > -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009 > ~~~~~~~~~~~~~~~~~~~~~~ > @ Multi-class Training [Based on MapReduce Framework]:@ > ~~~~~~~~~~~~~~~~~~~~~~ > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080 > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ Testing: @@ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SVMPegasosTesting.java > I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. > The default argument is: > -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model > ~~~~~~~~~~~~~~~~~~~~~~ > @ Parallel Testing (Classification): @ > ~~~~~~~~~~~~~~~~~~~~~~ > ParallelClassifierDriver.java > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080 > ~~~~~~~~~~~~~~~~~~~~~~ > @ Parallel multi-classification: @ > ~~~~~~~~~~~~~~~~~~~~~~ > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080 > Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Regression: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > SVMPegasosTraining.java > -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1 > ------------------------------------------------------------------------------------------- > Experimental Results: > ------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Classsification: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name source type class training size testing size feature > ----------------------------------------------------------------------------------------------- > rcv1.binary [DL04b] classification 2 20,242 677,399 47,236 > covtype.binary UCI classification 2 581,012 54 > a9a UCI classification 2 32,561 16,281 123 > w8a [JP98a] classification 2 49,749 14,951 300 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy | Training Time | Testing Time | > rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec | > covtype.binary | | 19 Sec | | > a9a | 84.72% | 14 Sec | 12 Sec | > w8a | 89.8 % | 14 Sec | 8 Sec | > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Parallel Classification (Testing) > Data set | Accuracy | Training Time | Testing Time | > rcv1.binary | 94.98% | 19 Sec | 3 min 29 Sec (one node)| > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Parallel Multi-classification Based on MapReduce Framework: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name | source | type | class | training size | testing size | feature > ----------------------------------------------------------------------------------------------- > poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10 > protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy vs. (Libsvm with linear kernel) > poker | 50.14 % vs. ( 49.952% ) | > protein | 68.14% vs. ( 64.93% ) | > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Regression: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name | source | type | class | training size | testing size | feature > ----------------------------------------------------------------------------------------------- > abalone | UCI | regression | 4,177 | | 8 > triazines | UCI | regression | 186 | | 60 > cadata | StatLib | regression | 20,640 | | 8 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Mean Squared error vs. (Libsvm with linear kernel) | Training Time | Test Time | > abalone | 6.01 vs. (5.25) | 13 Sec | > triazines | 0.031 vs. (0.0276) | 14 Sec | > cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec | -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.