mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject Re: Diagnosing naive bayes results - now I'm really stumped
Date Sun, 29 Jan 2012 23:43:32 GMT
Hello,

   So I eliminated the feature that was basically a document id, and I'm still getting the
same results.

Based on what's been said on this thread, this should not happen (because we should always
be classifying into some category):
12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from: {basePath=/user/stu/machine_learning/bayes/model,
classifierType=bayes, alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, confusionMatrix=null,
encoding=UTF-8, defaultCat=unknown, testDirPath=/user/stu/machine_learning/bayes/category-test-data}
12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good -1537123.539470884 1845854.5550999944
-0.8327435849286697
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad -1845854.5550999944 1845854.5550999944
-1.0
12/01/29 15:30:30 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0	         �%

Yet, this is what I get (from a 90/10 split of the data using the splitBayesInput class from
Taming Text). 


So I'm stumped. 

I don't even really know where to begin debugging this..


And just to rule out the most obvious bonehead mistake:

hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
Found 2 items
-rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/bad
-rw-r--r--   3 stu supergroup   38614032 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/good

Here's a couple snippets from my seqdump:

Key class: class org.apache.mahout.common.StringTuple Value Class: class org.apache.hadoop.io.DoubleWritable
Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good, 0array]: Value: 10.481077499671203
Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
Key: [__WT, good, 0x1]: Value: 1342.2191134942075
Key: [__WT, good, 0x10000]: Value: 243.74351518918098
Ted,
If you're interested, I can send over the whole seqdump file just to you, but I'm a little
wary of posting it to the whole list at this point...
Once I understand the problem more, I might realize that giving away the information won't
hurt anything...


Thoughts?

Take care,
  -stu




________________________________
 From: Ted Dunning <ted.dunning@gmail.com>
To: user@mahout.apache.org; Stuart Smith <stu24mail@yahoo.com> 
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results
 
It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.

With two categories, this is reasonable.  For a dozen it is much more
debatable.

This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <stu24mail@yahoo.com> wrote:

> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message