Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 31170 invoked from network); 1 Mar 2010 21:34:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Mar 2010 21:34:44 -0000 Received: (qmail 11910 invoked by uid 500); 1 Mar 2010 21:34:42 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 11846 invoked by uid 500); 1 Mar 2010 21:34:42 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 11834 invoked by uid 99); 1 Mar 2010 21:34:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 21:34:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robin.anil@gmail.com designates 209.85.222.201 as permitted sender) Received: from [209.85.222.201] (HELO mail-pz0-f201.google.com) (209.85.222.201) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 21:34:35 +0000 Received: by pzk39 with SMTP id 39so169128pzk.15 for ; Mon, 01 Mar 2010 13:34:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:from:date:message-id :subject:to:content-type; bh=Vka6/ozZ2cKE+lGWlWZOMxJKESQH9QX1qkT+LVtz7bc=; b=c0Cb7VkhywSO4WmboPcgF3LtGVWnywg9d6q5rPo9c+shhLTm4nUfdwhLcw2jRlxTIN JAhrbTvrgBeUCJ/I0L3bIwlOK3Y8eFBSvF/AHnjn6udmWm7jvU3MZoN6e7uluELlggrB 6GwdFCC/dIUvOd1C2mLV9cxw0ppK/k9ZgGlWY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=vpZMEHLumwrD/gdpGLNgh7GqpsMnCi0MpJlwCxGGp5Qrsi3iDguj8R2jCz4+RiAUDm d+MMZNUgH1HsbrYLIs6XQ+iu/viWDEVw4OetKsxa2Q6yK5oWe5EYtotgFUevOvFCdR2H inozFT3lONcvMtiNy0Me0evCqvhFGfaosSmU4= MIME-Version: 1.0 Received: by 10.141.14.6 with SMTP id r6mr2804241rvi.280.1267479249085; Mon, 01 Mar 2010 13:34:09 -0800 (PST) From: Robin Anil Date: Tue, 2 Mar 2010 03:03:49 +0530 Message-ID: <7d7600c51003011333k5d503164q3d07bb894e15089@mail.gmail.com> Subject: Classifier Architecture To: mahout-dev Content-Type: multipart/alternative; boundary=000e0cd1125266e7c70480c40280 --000e0cd1125266e7c70480c40280 Content-Type: text/plain; charset=UTF-8 I am kicking this discussion on how we are going to integrate RF, NB, CNB, WINNOW, SGD, SVM Phew!. Since I wrote NB and CNB, I will list down(in a subsequent emails in blocks) what my assumptions were, how it integrates with hdfs and hbase. How training, testing, online and batch classification is done. >From what I think right now - Apart from SGD everything else a batch trainer. SGD is the only pure online trainer cum classifier. - NB/CNB was designed as a binary feature classifier (as per the paper it was on text) and does multi label classification as a simple score comparison. - NB/CNB uses only tokens as features so there is no need to convert text features to integers back and forth. Randomizers removes this limitation, if we go ahead with only that - NB/CNB uses the tokens as the row and column byte when looking up in the HBase table - SVM and others use threshold values for each feature to decide the cutting plane. - RF uses vectors to store labels and does not use the multi label vector at the moment Questions - Interfaces (how are they going to look like) - Trainer - Classifier (binary) (multi label classification) - Test - Ensemble - bagging boosting ? - What is the basic storage interface everyone should use Matrix? Then we can have hdfs backed matrix, hbase backed matrix, inmemory matrix - If basic storage could be different i mean decision tree is not a matrix, what is the fixed input output format. - How can we extend the test setup like Confusion matrix to capture info from all classifiers - If we make some assumptions now what will we do when classifiers like HMM, CRF come into the picture. they need more than just vectors but also the order of features. Robin --000e0cd1125266e7c70480c40280--