From Loek Cleophas <>
Subject 20newsgroups example/TestClassifier code - bug/oddity?
Date Thu, 18 Feb 2010 10:12:22 GMT

While playing around some more with the 20newsgroups example code for  
the Bayes classifiers, I ran into an oddity and a presumable bug:

instead of using (parts of) the 20 newsgroups data set, which was  
split nicely into one file per newsgroup, with the 'category, tab,  
tokens' line format, I generated such a file out of our company data  
set. What I did though was generate 1 file to train, and 1 to test  
with - so both files could have different lines having different  
categories, e.g.

cars	Ferrari red ....
animals	cow cat dog ....

In training, this works fine.  In testing, it crashes TestClassifier  
with a null pointer exception. I presume that is because either the  
file name does not match category.txt for some category name, or  
because there's multiple categories being used inside the single file  
- but I also presume that neither should crash the thing :) It also  
brings up the question: if the line format in the data files has the  
category in there, then why are the file names relevant at all? Seems  
like redundancy to me. Shouldn't TestClassifier merely take all .txt  
files in the input data directory and process their contents?


