mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input
Date Tue, 13 Dec 2011 13:19:32 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168367#comment-13168367
] 

jiraposter@reviews.apache.org commented on MAHOUT-904:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/#review3876
-----------------------------------------------------------


Thoughts:
this class is often run from the command line, so we should add CLI support for telling it
to randomly permute.

I wonder if we should make this a map-reduce job.  Perhaps we split out the existing version
and leave as is and then add a new MR one that can do the permutation.  One idea there would
be to generate random keys (by appending onto the existing key) and letting the shuffle effectively
do the permutations.  Then, during reduce phase we simply strip off the random part of the
key and output.  I don't know how bad this would hurt the shuffle, but it seems like it would
work functionally anyway.

Otherwise, the approach seems reasonable.  I don't know off hand if there is a better way
of doing it (even though I wish there were).

- Grant


On 2011-12-09 08:57:18, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3092/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-09 08:57:18)
bq.  
bq.  
bq.  Review request for Grant Ingersoll.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Early support for randomizing input in SplitInput class. This is an early start but I've
posted it up just to check if I'm on the right track.  A couple of comments:
bq.  
bq.    - currently the code runs through the entire file looking for the line corresponding
to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
bq.    - the permutation indices are stored in an array. This could lead to scaling issues
if the number of input lines is large. This problem may also exist with ridx in the existing
code. One option is to use a linear feedback shift register to generate a permutation sequence
on the fly.
bq.  
bq.  Any suggestions would be very welcome!
bq.  
bq.  
bq.  This addresses bug MAHOUT-904.
bq.      https://issues.apache.org/jira/browse/MAHOUT-904
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 
bq.    /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java
1212249 
bq.  
bq.  Diff: https://reviews.apache.org/r/3092/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> SplitInput should support randomizing the input
> -----------------------------------------------
>
>                 Key: MAHOUT-904
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-904
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-904.patch
>
>
> For some learning tasks, we need the input to be randomized (SGD) instead of blocks of
labels all at once.  SplitInput is a useful tool for setting up train/test files but it currently
doesn't support randomizing the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message