Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@locus.apache.org Received: (qmail 27760 invoked from network); 21 Mar 2008 09:08:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Mar 2008 09:08:47 -0000 Received: (qmail 56900 invoked by uid 500); 21 Mar 2008 09:08:45 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 56867 invoked by uid 500); 21 Mar 2008 09:08:45 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 56858 invoked by uid 99); 21 Mar 2008 09:08:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2008 02:08:45 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of volezheng@gmail.com designates 209.85.146.181 as permitted sender) Received: from [209.85.146.181] (HELO wa-out-1112.google.com) (209.85.146.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2008 09:08:07 +0000 Received: by wa-out-1112.google.com with SMTP id j40so1468532wah.11 for ; Fri, 21 Mar 2008 02:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=Bcu8HKBuu2BZqAcE9iw0ltJKKIvkMtuAE52CbZYVdtY=; b=JTte2iXD88c7cavFB1yulznpO22rYI6cQUGwpq+h9bfh/xx47y5yDwm5WCkmbnaR/dSYXMNeydyngb8gEkuUV6Rwx9pKf/QxnPM5ZHAYaqnrT0U3Fn30hdJd2NzHrU0hgfbQvl74XZezJE0TOmKEDhnzbipni5YvXs9U0VknUAQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=TiJcw90eyMEs+Jp/fuZpkrz4rVPf+MsZPr4hEPcvF2w4KFlEw2eNpQmR0tOHwLXkPx0VMTenisrYlNelbnXQV8Cmd5cZ72QMGWvIJ77PQIQcT+YfukEzShlBqbx057WPyNekoAu5svFj85bIKHaYB/84OQ/jFoOGVnEHvzOqggo= Received: by 10.115.95.1 with SMTP id x1mr5347913wal.122.1206090498378; Fri, 21 Mar 2008 02:08:18 -0700 (PDT) Received: by 10.115.74.14 with HTTP; Fri, 21 Mar 2008 02:08:18 -0700 (PDT) Message-ID: Date: Fri, 21 Mar 2008 17:08:18 +0800 From: "Hao Zheng" To: mahout-dev@lucene.apache.org Subject: Re: application of GSoC In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Virus-Checked: Checked by ClamAV on apache.org I understand. Actually, I mean the other thing. Maybe "feature selection" is not precise, let me restate my question. Generally, no matter for image recognition or text classification, we have to ture the original material into a featrue vector. This step is called "feature extration" or sth like that. My question is will this step be part of the mahout project? If yes, we have to care about the transformation step; if not, all we need to process are the numbers, which will make thing easier. On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning wrote: > > I think a better description is that this project is about ML algorithms > that need large scale. > > If you have very inexpensive feature selection that can run sequentially, > then it probably doesn't matter to use hadoop/mahout for that. Some forms > of feature extraction is very expensive, however, and could definitely > benefit from parallelism. For instance, you could imagine that the feature > extraction step involves a large scale non-deterministic clustering. It > might even be that the the feature extraction requires parallel processing, > but the actual learning algorithm does not. > > > > > On 3/20/08 5:57 PM, "Hao Zheng" wrote: > > > Another question, this project is all about the ML algorithm itself? > > all we will deal with is feature vectors/matrix constructed already? > > that is, the project will not include feature selection part of ML, > > e.g. extracting feature vector from a document collection? > >