mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <>
Subject feature vector encoding in Mahout
Date Tue, 14 Dec 2010 23:40:31 GMT

After going through the newest chapters in MIA (very helpful btw), I have a few questions
that I think I know the answer to, but just wanted to get some reinforcement. 

Let's say that I have a list of documents and my own pipeline for feature extraction.  So,
for each document I have a list of key words (and multi-key word phrases) and corresponding
weights.  So each document is now just a list of keyword phrases and weights i.e.

phrase1   wt1
phrase2   wt2
phrase3   wt3

I would like to use Mahout to train document classifiers using the phrases and weights in
these files.

Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like I can just use
the encoder class for these phrases and weights.  Something like this:

RecordValueEncoder encoder = 
	new StaticWordValueEncoder("variable-name");
for (DataRecord ex: trainingData) {
	Vector v = new RandomAccessSparseVector(10000);
	String word = ex.get("variable-name");
	encoder.addToVector(word, v); 

Does this make sense?

I would like to compare the results of an SGD and Naive Bayes classification using this data.
 However, I am unclear of the vector formation process in Naive Bayes.  I have prepared some
input for the Bayes classifier using prepare20newsgroups "macro" - I was able to get my data
into a similar format as the 20 news groups dataset.  I guess my main question is can I use
Naive Bayes if I already have the features (phrases above)  and weights that I want to use
for training?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message