mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1055) Change id fields to use LongWritable instead of IntWritable
Date Mon, 13 Aug 2012 14:16:38 GMT


Ted Dunning commented on MAHOUT-1055:

The major problem with long's are that you can't index arrays with a long.

The standard work-around is to use a hash of the long as the integer sized ID.

> Change id fields to use LongWritable instead of IntWritable
> -----------------------------------------------------------
>                 Key: MAHOUT-1055
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Markus Paaso
> Why is IntWritable used as id field type in Mahout CVB? (org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper)
> Does Long have that significant impact on performance?
> Long is much more usable as id type and int causes compatibility issues like the one
> In method org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter() LongWritable
is used correctly as id field type.
> I suggest that every IntWritable id should be changed to LongWritable.
> Sequencefile produced by command 'mahout lucene.vector' cannot be handled by command
'mahout cvb' due to this id type incompatibility issue.
> see

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message