mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raphael Cendrillon (Commented) (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input
Date Thu, 22 Dec 2011 15:33:31 GMT


Raphael Cendrillon commented on MAHOUT-904:

Thanks Grant. I was wondering the same thing, for example supporting randomSelectionSize in
addition to randomSelectionPct. However supporting size based splits may not be quite so straightforward
since the size is generally unknown if the SequenceFile is large, plus its split across mappers.

I also would have liked to have the training and test outputs go to different directories
(instead of just using different filename prefixes), but this is not quite so straightforward
due to issues with the new API (unless I just write to the SequenceFile by hand in the reducer
which raises its own issues).  I think this can be made a little neater once we move to Hadoop

Is there something else that you had in mind?

> SplitInput should support randomizing the input
> -----------------------------------------------
>                 Key: MAHOUT-904
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Raphael Cendrillon
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch,
> For some learning tasks, we need the input to be randomized (SGD) instead of blocks of
labels all at once.  SplitInput is a useful tool for setting up train/test files but it currently
doesn't support randomizing the input.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message