mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: discussion of input conversions
Date Wed, 24 Aug 2011 22:35:14 GMT
somewhat -1 too. Just because :)

as far as i understand, arff just contains a way to name attributes
and present types others than double, which is why it is not DRM and
DRM is not ARFF.

I'd rather re-engineer ARFF parser if needs be.



On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
> My initial inclination is -1 on adding a GPL dependency.
>
> Can you spell out exactly what is meant by needing a "general input format"
> and "general transfer format".  We currently take in raw text, and then
> vectorize it.   Are Vectors (with either hashed encoding, or with a
> dictionary
> file) not suitable as a format for some reason?
>
>  -jake
>
> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> Praneet and I were just talking about a project he is working on to do with
>> higher-order learning methods such as boosting and feature sharding.  This
>> is all pretty much in the context of classification and possibly
>> clustering.
>>
>> The problems are:
>>
>> a) mahout doesn't have a general input format for classifiable data (this
>> has been discussed recently)
>>
>> b) hashed vector representations are not suitable for feature sharding
>> since
>> individual features may be redundantly represented in many locations.
>>
>> c) mahout doesn't have a reasonable data structure for general data
>> transfer
>> (related to -a-)
>>
>> One possible thought is that Mahout could introduce Weka as a dependency.
>>
>> The virtues would be:
>>
>> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
>> and (c)
>>
>> 2) Weka provides a bunch of simple classifier algorithms which are not
>> individually scalable, but might be made to be so by model averaging or
>> feature sharding.
>>
>> 3) Praneet could finish his project very quickly.
>>
>> Any thoughts about this?
>>
>> The problems that I see with this include:
>>
>> A) Weka is GPL which might slow adoption of Mahout and would certainly
>> inhibit direct incorporation of any piece of Weka
>>
>> B) Weka appears to have not caught the maven bug which makes it harder to
>> add as a dependency without actually distributing the weka jar.
>>
>> One possible work-around might be to reverse engineer something like
>> Instance and an ARFF reader/writer.
>>
>

Mime
View raw message