mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input
Date Wed, 14 Dec 2011 18:27:18 GMT
On Wed, Dec 14, 2011 at 1:01 PM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> Thanks Lance. If I understand you correctly you're proposing the following:
>
> Map: (K1,V1) -> (K2,V2)
>  V2 = V1
>  K2 = hashcode(K1)
>

Preserving K1 may be important.  In that case you may prefer


>  emit(K2,V2)
>

emit(K2, [K1, V1])


>
> Combine: (K2,V2) -> (K3,V3)
> (e.g. if we want to keep 10% of samples)
>  if ( ! K2%10 ) {
>

Why not keep this in the mapper?

>
> Also I'm wondering if we can do downsampling at the mapper? Would that be
> more efficient?
>

Yes.  It would be.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message