mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
Date Tue, 05 Jan 2010 02:46:55 GMT
Map/Reduce Implementation of Document Vectorizer
------------------------------------------------

                 Key: MAHOUT-237
                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
             Project: Mahout
          Issue Type: New Feature
    Affects Versions: 0.3
            Reporter: Robin Anil
            Assignee: Robin Anil
             Fix For: 0.3


Current Vectorizer uses Lucene Index to convert documents into SparseVectors
Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size
and sum it up to get the document Vector
This is a pure bag-of-words based Vectorizer written in Map/Reduce. 

The input document is in SequenceFile<Text,Text> . with key = docid, value = content
First Map/Reduce over the document collection and generate the feature counts.
Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text,
LongWritable> where key=feature, value = unique id 
    Second stage should create shards of features of a given split size
Third Map/Reduce over the document collection, using each shard and create Partial(containing
the features of the given shard) SparseVectors 
Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message