mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Cross validation
Date Tue, 01 Apr 2008 18:02:55 GMT
With a "variable set" do you mean a sub set of features (attributes, 
columns) to be evaluated with a cross validation?

I really know to little and have to read up a bit in the Weka book (Data 
mining, 2nd edition, page 420-425) and take a look if some algorithm 
makes more sense than others.

It would be nice to to feature select and rank a 100K features wide 
ngram Lucene index with various classes.


Ted Dunning skrev:
> Yes.  This is a form of model selection.  It would be plausible to run the
> cross-folds and learning in parallel.  Cross validation would only give
> small parallelism, but if you have several hundred variable sets, that
> becomes plausible. 
> This raises the question of what the right map-reduce architecture would be
> for this sort of architecture.  Should there be a special input format that
> reads input records with a test/train/fold# key or column?  The thought
> would be that normal sequential learning could be done in the reducer, or
> the folded data could be passed to separate learning algorithms.
> On 3/31/08 9:08 AM, "Karl Wettin" <> wrote:
>> Paul Elschot skrev:
>>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>>> Paul Elschot skrev:
>>>>> Parallelizing cross validation may also be trivial, but it would be
>>>>> quite useful.
>>>> I know it can be used for feature selection. What else is there?
>>> Actually, I meant no more than K-fold cross validation:
>>> It nicely parallelizes to a factor of K.
>> Ah, OK.
>> I mean that many feature selection algorithms more or less is a series
>> of cross fold validations using some classifier on either a single or
>> subset of available attributes.
>>     karl

View raw message