Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F359488B1 for ; Wed, 24 Aug 2011 22:35:44 +0000 (UTC) Received: (qmail 74110 invoked by uid 500); 24 Aug 2011 22:35:44 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 74043 invoked by uid 500); 24 Aug 2011 22:35:43 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 74029 invoked by uid 99); 24 Aug 2011 22:35:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Aug 2011 22:35:43 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Aug 2011 22:35:35 +0000 Received: by vxh24 with SMTP id 24so2513514vxh.1 for ; Wed, 24 Aug 2011 15:35:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=sajG2fEu3zwBd5mmmGwmxjRUj7lFZL/y1inHe2RxbEE=; b=sDqT9f55euM+rTUKgGE1YEtzo+JhcAt9WIPfiMGkEulXYEVvZ8qXE3CZxTmehqUjWW FrRgN1TIPYv5Z20T503IHHEzAl1A7ll8Mwpq/YdrpLS+v7tDlRK1w/O31RIlBcfPOYli KpGMHc3oA1ZRvs3OVntVZKtm9kNzXMqIbGoGA= MIME-Version: 1.0 Received: by 10.52.19.139 with SMTP id f11mr5811855vde.131.1314225314964; Wed, 24 Aug 2011 15:35:14 -0700 (PDT) Received: by 10.52.183.163 with HTTP; Wed, 24 Aug 2011 15:35:14 -0700 (PDT) In-Reply-To: References: Date: Wed, 24 Aug 2011 15:35:14 -0700 Message-ID: Subject: Re: discussion of input conversions From: Dmitriy Lyubimov To: dev@mahout.apache.org Cc: praneet mhatre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org somewhat -1 too. Just because :) as far as i understand, arff just contains a way to name attributes and present types others than double, which is why it is not DRM and DRM is not ARFF. I'd rather re-engineer ARFF parser if needs be. On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix wrote: > My initial inclination is -1 on adding a GPL dependency. > > Can you spell out exactly what is meant by needing a "general input forma= t" > and "general transfer format". =A0We currently take in raw text, and then > vectorize it. =A0 Are Vectors (with either hashed encoding, or with a > dictionary > file) not suitable as a format for some reason? > > =A0-jake > > On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning wrot= e: > >> Praneet and I were just talking about a project he is working on to do w= ith >> higher-order learning methods such as boosting and feature sharding. =A0= This >> is all pretty much in the context of classification and possibly >> clustering. >> >> The problems are: >> >> a) mahout doesn't have a general input format for classifiable data (thi= s >> has been discussed recently) >> >> b) hashed vector representations are not suitable for feature sharding >> since >> individual features may be redundantly represented in many locations. >> >> c) mahout doesn't have a reasonable data structure for general data >> transfer >> (related to -a-) >> >> One possible thought is that Mahout could introduce Weka as a dependency= . >> >> The virtues would be: >> >> 1) Weka has ARFF as a data format and Instance as an object to satisfy (= a) >> and (c) >> >> 2) Weka provides a bunch of simple classifier algorithms which are not >> individually scalable, but might be made to be so by model averaging or >> feature sharding. >> >> 3) Praneet could finish his project very quickly. >> >> Any thoughts about this? >> >> The problems that I see with this include: >> >> A) Weka is GPL which might slow adoption of Mahout and would certainly >> inhibit direct incorporation of any piece of Weka >> >> B) Weka appears to have not caught the maven bug which makes it harder t= o >> add as a dependency without actually distributing the weka jar. >> >> One possible work-around might be to reverse engineer something like >> Instance and an ARFF reader/writer. >> >