mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raphael Cendrillon" <cendrillon1...@gmail.com>
Subject Re: Review Request: Support for Randomizing Input in SplitInput Class
Date Thu, 22 Dec 2011 16:03:47 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-22 16:03:47.932528)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Grant's changes


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted
it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to
the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the
number of input lines is large. This problem may also exist with ridx in the existing code.
One option is to use a linear feedback shift register to generate a permutation sequence on
the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION

  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message