mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <ap....@outlook.com>
Subject RE: Insights to Naive Bayes classifier example - 20news groups
Date Mon, 01 Dec 2014 16:43:24 GMT
Hi Jakub,

The step that you are missing is `$mahout seqdir ...`.   in this step each file in each directory
(where the directory is the Category) is converted into a sequence file of form <Text,Text>
 where the Text key is /Category/doc_id.

`$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` into a sequence file
of form <Text, VectorWritable> leaving the Keys unchanged.  

`$mahout trainnb ... -el ...` then extracts the label from the Keys of the training data ie.
the "Category" from /Category/doc_id.  

please see http://mahout.apache.org/users/classification/twenty-newsgroups.html
and http://mahout.apache.org/users/classification/bayesian.html
for more information.

> Date: Mon, 1 Dec 2014 17:09:55 +0100
> Subject: Insights to Naive Bayes classifier example - 20news groups
> From: stransky.ja@gmail.com
> To: user@mahout.apache.org
> 
> Hello Mahout experts,
> 
> I am trying to follow some examples provided with Mahout and some features
> are not clear to me. It would be great if someone could clarify a bit more.
> 
> To prepare a the data (train and test) the following sequence of steps is
> perfomed (taken from mahout cookbook):
> 
> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> 
> Converted to hadoop sequence file and then vectorized:
> *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
> -lnorm -nv -wt tfidf*
> 
> Devided to test and train data:
> *./mahout split*
> *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> *--testOutput ${WORK_DIR}/20news-test-vectors*
> *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> 
> Model is trained:
> *./mahout trainnb*
> *-i ${WORK_DIR}/20news-train-vectors -el*
> *-o ${WORK_DIR}/model*
> *-li ${WORK_DIR}/labelindex*
> *-ow*
> 
> 
> What I am missing here and that is subject of my question is: Where is the
> category assigned to the testing data to train the categorization? What I
> would expect is that there will be vector which says that this document
> belongs to a particular category. This seems to me has been ereased by
> first step where we mixed all the data to create our corpus. I would still
> expect that this information will be somewhere retained. Instead the
> messages looks as follows:
> 
> From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> Organization: Oklahoma State University
> Lines: 7
> 
> From article <a4Fm3B1w165w@vicuna.ocunix.on.ca>, by Steve Frampton <
> frampton@vicuna.ocunix.on.ca>:
> > I was wondering, is the "Kermit" package (the actual package, not a
> 
> Yes!  In the usual ftp sites.
> 
> Yek CHong
> 
> 
> There is no notion from which group this text belongs to. What's the hack!
> 
> Could someone please clarify a bit what's going on as when crosswalidation
> is performed - confusion matrix takes into consideration those categories.
> 
> Thanks a lot for helping me out
> Jakub
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message