mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Eastman (JIRA)" <>
Subject [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable
Date Sat, 20 Jun 2009 00:54:07 GMT


Jeff Eastman commented on MAHOUT-136:

r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume Canopy centroids as Writable
values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id and not full serialized
- This eliminates the need for the OutputDriver and OutputMapper in synthetic control example
so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs

NOTE: When passing Vectors between Map and Reduce steps using Writable format, Hadoop uses
the *same instance* to do all of the deserializations. I had to change the Canopy constructors
to clone() their center arguments so that the same instance would not be reused for multiple

> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>                 Key: MAHOUT-136
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.1
> Internal serialization of Canopy currently uses asFormatString rather than just making
the Canopy writable. This is storage inefficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message