mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: How to convert SequenceFile<LongWritable,VectorWritable> to SequenceFile<IntWritable,VectorWritable>?
Date Wed, 25 May 2011 20:52:01 GMT
If keys are distributed across the keyspace then yes it is a net loss to try
variable-length encoding. However it's my impression that keys aren't in
many contexts. (I actually haven't thought about this one hard.)

But for example in recommender-land where keys are product IDs, it's more
common for there to be millions of keys ranging in value to, well, a few
million, than spread across the key space.

On Wed, May 25, 2011 at 9:37 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> On Wed, May 25, 2011 at 1:33 PM, Sean Owen <srowen@gmail.com> wrote:
>
> > (I suggest we not use IntWritable or LongWritable, but favor
> VarIntWritable
> > and VarLongWritable, which are variable length encoding versions, where
> > possible. Saving a couple bytes per key adds up.)
> >
>
> If you have millions to hundreds of millions of keys, how many of them are
> going to be low enough to fit in less than 4 bytes?  As soon as you have
> more than 16 million, "most" numbers take up the full 4 bytes, right?
>
>  -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message