Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of volezheng@gmail.com
 designates 209.85.146.181 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=TiJcw90eyMEs+Jp/fuZpkrz4rVPf+MsZPr4hEPcvF2w4KFlEw2eNpQmR0tOHwLXkPx0VMTenisrYlNelbnXQV8Cmd5cZ72QMGWvIJ77PQIQcT+YfukEzShlBqbx057WPyNekoAu5svFj85bIKHaYB/84OQ/jFoOGVnEHvzOqggo=
Message-ID: <abec10af0803210208x7c87c1d2g4c1dafa95d1ecbdd@mail.gmail.com>
Date: Fri, 21 Mar 2008 17:08:18 +0800
From: "Hao Zheng" <volezheng@gmail.com>
To: mahout-dev@lucene.apache.org
Subject: Re: application of GSoC
In-Reply-To: <C4085734.3AC76%tdunning@veoh.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <abec10af0803201757r3f53bb12h240dad1ab50b1910@mail.gmail.com>
	 <C4085734.3AC76%tdunning@veoh.com>

I understand. Actually, I mean the other thing. Maybe "feature
selection" is not precise, let me restate my question.

Generally, no matter for image recognition or text classification, we
have to ture the original material into a featrue vector. This step is
 called "feature extration" or sth like that. My question is will this
step be part of the mahout project? If yes, we have to care about the
transformation step; if not, all we need to process are the numbers,
which will make thing easier.

On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <tdunning@veoh.com> wrote:
>
>  I think a better description is that this project is about ML algorithms
>  that need large scale.
>
>  If you have very inexpensive feature selection that can run sequentially,
>  then it probably doesn't matter to use hadoop/mahout for that.  Some forms
>  of feature extraction is very expensive, however, and could definitely
>  benefit from parallelism.  For instance, you could imagine that the feature
>  extraction step involves a large scale non-deterministic clustering.  It
>  might even be that the the feature extraction requires parallel processing,
>  but the actual learning algorithm does not.
>
>
>
>
>  On 3/20/08 5:57 PM, "Hao Zheng" <volezheng@gmail.com> wrote:
>
>  > Another question, this project is all about the ML algorithm itself?
>  > all we will deal with is feature vectors/matrix constructed already?
>  > that is, the project will not include feature selection part of ML,
>  > e.g. extracting feature vector from a document collection?
>
>